A fast and accurate coalescent approximation

Blog author Suyash Shringarpure is a postdoc in Carlos Bustamante's lab.

Blog author Suyash Shringarpure is a postdoc in Carlos Bustamante’s lab. Suyash is interested in statistical and computational problems involved in the analysis of biological data.

The coalescent model is a powerful tool in the population geneticist’s toolbox. It traces the history of a sample back to its most recent common ancestor (MRCA) by looking at coalescence events between pairs of lineages. Starting from assumptions of random mating, selective neutrality, and constant population size, the coalescent uses a simple stochastic process that allows us to study properties of genealogies, such as the time to the MRCA and the length of the genealogy, analytically and through efficient simulation. Extensions to the coalescent allow us to incorporate effects of mutation, recombination, selection and demographic events in the coalescent model. A short introduction to the coalescent model can be found here and a longer, more detailed introduction can be read here.

However, coalescent analyses can be slow or suffer from numerical instability, especially for large samples. In a study published earlier this year in Theoretical Population Biology, CEHG fellow Ethan Jewett and CEHG professor Noah Rosenberg proposed fast and accurate approximations to general coalescent formulas and procedures for applying such approximations. Their work also examined the asymptotic behavior of existing coalescent approximations analytically and empirically.

Computational challenges with the coalescent

For a given sample, there are many possible genealogical histories, i.e., tree topologies and branch lengths, which are consistent with the allelic states of the sample. Analyses involving the coalescent therefore often require us to condition on a specific genealogical property and then sum over all possible genealogies that display the property, weighted by the probability of the genealogy. A genealogical property that is often conditioned on is n_t, the number of ancestral lineages in the genealogy at a time t in the past. However, computing the distribution P(n_t) of n_t is computationally expensive for large samples and can suffer from numerical instability.

A general approximation procedure for formulas conditioning on n_t

Coalescent formulas conditioning on n_t typically involve sums of the form f(x)=\sum_{n_t} f(x|n_t) \cdot P(n_t)

For large samples and recent times, these computations have two drawbacks:

–       The range of possible values for n_t may be quite large (especially if multiple populations are being analyzed) and a summation over these values may be computationally expensive.

–       Expressions for P(n_t) are susceptible to round-off errors.

Slatkin (2000) proposed an approximation to the summation in f(x) by a single term f(x|E[n_t]). This deterministic approximation was based on the observation that n_t changes almost deterministically over time, even though it is a stochastic variable in theory. Thus we can write n_t \approx E[n_t]. From Figure 2 in the paper (reproduced here), we can see that this approximation is quite accurate. The authors prove the asymptotic accuracy of this approximation and also prove that under regularity assumptions, f(x|E[n_t]) converges to f(x) uniformly in the limits of t \rightarrow 0 and t \rightarrow \infty . This is an important result since it shows that the general procedure produces a good approximation for both very recent and very ancient history of the sample. Further, the paper shows how this method can be used to approximate quantities that depend on the trajectory of n_t over time, which can be used to calculate interesting quantities such as the expected number of segregating sites in a genealogy.

Approximating E[n_t] for single populations

A difficulty with using the deterministic approximation is that E[n_t] often has no closed-form formula, and if one exists, it is typically not easy to compute when the sample is large.

For a single population with changing size, two deterministic approximations have previously been developed (one by Slatkin and Rannala 1997, Volz et al. 2009 and one by Frost and Volz, 2010, Maruvka et al., 2011). Using theoretical and empirical methods, the authors examine the asymptotic behavior and computational complexity of these approximations and a Gaussian approximation by Griffiths. A summary of their results is in the table below.

Method Accuracy
Griffith’s approximation Accurate for large samples and recent history.
Slatkin and Rannala (1997), Volz et al. (2009) Accurate for recent history and arbitrary sample size, inaccurate for very ancient history.
Frost and Volz (2010), Maruvka et al. (2011) Accurate for both recent and ancient history and for arbitrary sample size.
Jewett and Rosenberg (2014) Accurate for both recent and ancient history and arbitrary sample size, and for multiple populations with migration.


Approximating E[n_t] for multiple populations

Existing approaches only work for single populations of changing size and cannot account for migration between multiple populations. Ethan and Noah extend the framework for single populations to allow multiple populations with migration. The result is a system of simultaneous differential equations, one for each population. While it does not allow for analytical solutions except in very special cases, the system can be easily solved numerically for any given demographic scenario.

Significance of this work

The extension of the coalescent framework to multiple populations with migration is an important result for demographic inference. The extended framework with multiple populations allows efficient computation of demographically informative quantities such as the expected number of private alleles in a sample, divergence times between populations.

Ethan and Noah describe a general procedure that can be used to approximate coalescent formulas that involve summing over distributions conditioned on n_t or the trajectory of n_t over time. This procedure is particularly accurate for studying very recent or very ancient genealogical history.

The analysis of existing approximations to E[n_t] show that different approximations have different asymptotic behavior and computational complexities. The choice of which approximation to use is therefore often a tradeoff between the computational complexity of the approximation and the likely behavior of the approximation in the parameter ranges of interest.

Future Directions

As increasingly large genomic samples from populations with complex demographic histories become available for study, exact methods either become intractable or very slow. This work adds to a growing set of approximations to the coalescent and its extensions, joining other methods such as conditional sampling distributions and the sequentially markov coalescent. Ethan and Noah are already exploring applications of these approximate methods to reconciling gene trees with species trees. In the future, I expect that these and other approximations will be important for fast and accurate analysis of large genomic datasets.


[1] Jewett, E. M., & Rosenberg, N. A. (2014). Theory and applications of a deterministic approximation to the coalescent model. Theoretical population biology.

[2] Griffiths, R. C. (1984). Asymptotic line-of-descent distributions. Journal of Mathematical Biology21(1), 67-75.

[3] Frost, S. D., & Volz, E. M. (2010). Viral phylodynamics and the search for an ‘effective number of infections’. Philosophical Transactions of the Royal Society B: Biological Sciences365(1548), 1879-1890.

[4] Maruvka, Y. E., Shnerb, N. M., Bar-Yam, Y., & Wakeley, J. (2011). Recovering population parameters from a single gene genealogy: an unbiased estimator of the growth rate. Molecular biology and evolution28(5), 1617-1631.

[5] Slatkin, M., & Rannala, B. (1997). Estimating the age of alleles by use of intraallelic variability. American journal of human genetics60(2), 447.

[6] Slatkin, M. (2000). Allele age and a test for selection on rare alleles.Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences355(1403), 1663-1668.

[7] Volz, E. M., Pond, S. L. K., Ward, M. J., Brown, A. J. L., & Frost, S. D. (2009). Phylodynamics of infectious disease epidemics. Genetics183(4), 1421-1430.

Paper author Ethan Jewett is a PhD student in the lab of Noah Rosenberg.

Paper author Ethan Jewett is a PhD student in the lab of Noah Rosenberg.

How recombination and changing environments affect new mutations

Blog author: David Lawrie was a graduate student in Dmitri Petrov’s lab. He is now a postdoc at USC.

I recently sat down with Oana Carja, a graduate student with Marc Feldman, to discuss her paper published in the journal of Theoretical Population Biology entitled “Evolution with stochastic fitnesses: A role for recombination”. In it, the authors Oana Carja, Uri Liberman, and Marcus Feldman explore when a new mutation can invade an infinite, randomly mating population that experiences temporal fluctuations in selection.

The one locus case

This work builds off of previous research in the field on how the fluctuations in fitness over time (i.e., increased variance of fitness) affect the invasion dynamics of a mutation at a single locus. For a single locus, it has been shown that the geometric mean of the fitness of the allele over time determines the ability of an allele to invade a population. This effect is known as the geometric mean principle. Fluctuations in fitness increase the variance and therefore decrease the geometric mean fitness. The variance of the fitness of the allele over time thus greatly impacts the ability of that allele to invade a population.

What if there are two loci?

In investigating a two locus model, the researchers split the loci by their effect on the temporally-varying fitness: one locus only affects the mean, while the other controls the variance. The authors demonstrate through theory and simulation that:

1)    allowing for recombination between the two loci increases the threshold for the combined fitness of the two mutant alleles to invade the population beyond the geometric mean (see figure).

2)    periodic oscillations in the fitness of the alleles over time lead to higher fitness thresholds for invasion over completely random fluctuations (see figure).

3)    edge case scenarios allow for the maintenance of polymorphisms in the population despite clear selective advantages of a subset of allelic combinations.

Temporally changing environments and recombination thus make it overall more difficult for new alleles to invade a population.

Invasibility thresholds as a function of recombination rate. Recombination makes it more difficult for new alleles to invade a population.

Invasibility thresholds as a function of recombination rate. If there is no recombination (the left-most edge of the figure), the geometric mean of the pair of new alleles needs to be higher than 0.5 to allow for invasion, because the resident alleles’ geometric mean fitness is set to 0.5. However, as recombination between the two loci increases, the geometric mean needed for invasion increases rapidly. If there is free recombination (r = 0.5) then  invasion can only happen if the new alleles’ geometric mean fitness is twice the resident alleles’ geometric mean fitness (light grey area). If the environment is changing periodically, it is even harder for new alleles to invade a population (dark grey area).

The evolution of models of evolution

This work is important for addressing the evolutionary dynamics of loci controlling phenotypic variance – in this case, controlling the ability of a phenotype to maintain its fitness even if the environment is variable. Most environments undergo significant temporal shifts from the simple changing of the seasons to larger scale weather changes such as El Niño and climate change, in which species must survive and thrive. For organisms in the wild, many alleles that confer a benefit in one environment will be deleterious when the environment and selective pressures change. There may be modifier-loci which buffer the fitness of those loci in the face of changing environments. Such modifier-loci have been recently found in GWAS studies and may be important for overall phenotypic variance. Thus modeling the patterns of evolution for multiple loci in temporarily varying environments is a key component to advancing our understanding of the patterns found in nature.

Future work

Epigenetic modifiers are a hot area of research and one potential biological mechanism to control phenotypic variance. The evolution of such epigenetic regulation is a particular research interest of Oana. Future work will continue to explore the evolutionary dynamics of epigenetic regulation and focus on applying the above results to finite populations.

Paper author: Oana Carja is a graduate student with Marc Feldman


Oana Carja, Uri Liberman, Marcus W. Feldman, Evolution with stochastic fitnesses: A role for recombination, Theoretical Population Biology, Volume 86, June 2013, Pages 29-42, ISSN 0040-5809.