Coalescent theory

Coalescent theory is a model of how gene variants sampled from a population may have originated from a common ancestor. In the simplest case, coalescent theory assumes no recombination, no natural selection, and no gene flow or population structure, meaning that each variant is equally likely to have been passed from one generation to the next. The model looks backward in time, merging alleles into a single ancestral copy according to a random process in coalescence events. Under this model, the expected time between successive coalescence events increases almost exponentially back in time (with wide variance). Variance in the model comes from both the random passing of alleles from one generation to the next, and the random occurrence of mutations in these alleles.

The mathematical theory of the coalescent was developed independently by several groups in the early 1980s as a natural extension of classical population genetics theory and models[1][2][3][4], but can be primarily attributed to [5] Advances in coalescent theory include recombination, selection, overlapping generations and virtually any arbitrarily complex evolutionary or demographic model in population genetic analysis.

The model can be used to produce many theoretical genealogies, and then compare observed data to these simulations to test assumptions about the demographic history of a population. Coalescent theory can be used to make inferences about population genetic parameters, such as migration, population size and recombination.

Theory

Time to coalescence

Consider a single gene locus sampled from two haploid individuals in a population. The ancestry of this sample is traced backwards in time to the point where these two lineages coalesce in their most recent common ancestor (MRCA). Coalescent theory seeks to estimate the expectation of this time period and its variance.

The probability that two lineages coalesce in the immediately preceding generation is the probability that they share a parental DNA sequence. In a population with a constant effective population size with 2Ne copies of each locus, there are 2Ne "potential parents" in the previous generation. Under a random mating model, the probability that two alleles originate from the same parental copy is thus 1/(2Ne) and, correspondingly, the probability that they do not coalesce is 1 − 1/(2Ne).

At each successive preceding generation, the probability of coalescence is geometrically distributed—that is, it is the probability of noncoalescence at the t − 1 preceding generations multiplied by the probability of coalescence at the generation of interest:

${\displaystyle P_{c}(t)=\left(1-{\frac {1}{2N_{e}}}\right)^{t-1}\left({\frac {1}{2N_{e}}}\right).}$

For sufficiently large values of Ne, this distribution is well approximated by the continuously defined exponential distribution

${\displaystyle P_{c}(t)={\frac {1}{2N_{e}}}e^{-{\frac {t-1}{2N_{e}}}}.}$

This is mathematically convenient, as the standard exponential distribution has both the expected value and the standard deviation equal to 2Ne. Therefore, although the expected time to coalescence is 2Ne, actual coalescence times have a wide range of variation. Note that coalescent time is the number of preceding generations where the coalescence took place and not calendar time, though an estimation of the latter can be made multiplying 2Ne with the average time between generations. The above calculations apply equally to a diploid population of effective size Ne (in other words, for a non-recombining segment of DNA, each chromosome can be treated as equivalent to an independent haploid individual; in the absence of inbreeding, sister chromosomes in a single individual are no more closely related than two chromosomes randomly sampled from the population). Some effectively haploid DNA elements, such as mitochondrial DNA, however, are only carried by one sex, and therefore have one quarter the effective size of the equivalent diploid population (Ne/2)

Neutral variation

Coalescent theory can also be used to model the amount of variation in DNA sequences expected from genetic drift and mutation. This value is termed the mean heterozygosity, represented as ${\displaystyle {\bar {H}}}$. Mean heterozygosity is calculated as the probability of a mutation occurring at a given generation divided by the probability of any "event" at that generation (either a mutation or a coalescence). The probability that the event is a mutation is the probability of a mutation in either of the two lineages: ${\displaystyle 2\mu }$. Thus the mean heterozygosity is equal to

{\displaystyle {\begin{aligned}{\bar {H}}&={\frac {2\mu }{2\mu +{\frac {1}{2N_{e}}}}}\\[3pt]&={\frac {4N_{e}\mu }{1+4N_{e}\mu }}\\[3pt]&={\frac {\theta }{1+\theta }}\end{aligned}}}

For ${\displaystyle 4N_{e}\mu \gg 1}$, the vast majority of allele pairs have at least one difference in nucleotide sequence.