# A fast and accurate coalescent approximation

Blog author Suyash Shringarpure is a postdoc in Carlos Bustamante’s lab. Suyash is interested in statistical and computational problems involved in the analysis of biological data.

The coalescent model is a powerful tool in the population geneticist’s toolbox. It traces the history of a sample back to its most recent common ancestor (MRCA) by looking at coalescence events between pairs of lineages. Starting from assumptions of random mating, selective neutrality, and constant population size, the coalescent uses a simple stochastic process that allows us to study properties of genealogies, such as the time to the MRCA and the length of the genealogy, analytically and through efficient simulation. Extensions to the coalescent allow us to incorporate effects of mutation, recombination, selection and demographic events in the coalescent model. A short introduction to the coalescent model can be found here and a longer, more detailed introduction can be read here.

However, coalescent analyses can be slow or suffer from numerical instability, especially for large samples. In a study published earlier this year in Theoretical Population Biology, CEHG fellow Ethan Jewett and CEHG professor Noah Rosenberg proposed fast and accurate approximations to general coalescent formulas and procedures for applying such approximations. Their work also examined the asymptotic behavior of existing coalescent approximations analytically and empirically.

## Computational challenges with the coalescent

For a given sample, there are many possible genealogical histories, i.e., tree topologies and branch lengths, which are consistent with the allelic states of the sample. Analyses involving the coalescent therefore often require us to condition on a specific genealogical property and then sum over all possible genealogies that display the property, weighted by the probability of the genealogy. A genealogical property that is often conditioned on is $n_t$, the number of ancestral lineages in the genealogy at a time $t$ in the past. However, computing the distribution $P(n_t)$ of $n_t$ is computationally expensive for large samples and can suffer from numerical instability.

## A general approximation procedure for formulas conditioning on $n_t$

Coalescent formulas conditioning on $n_t$ typically involve sums of the form $f(x)=\sum_{n_t} f(x|n_t) \cdot P(n_t)$

For large samples and recent times, these computations have two drawbacks:

–       The range of possible values for $n_t$ may be quite large (especially if multiple populations are being analyzed) and a summation over these values may be computationally expensive.

–       Expressions for $P(n_t)$ are susceptible to round-off errors.

Slatkin (2000) proposed an approximation to the summation in $f(x)$ by a single term $f(x|E[n_t])$. This deterministic approximation was based on the observation that $n_t$ changes almost deterministically over time, even though it is a stochastic variable in theory. Thus we can write $n_t \approx E[n_t]$. From Figure 2 in the paper (reproduced here), we can see that this approximation is quite accurate. The authors prove the asymptotic accuracy of this approximation and also prove that under regularity assumptions, $f(x|E[n_t])$ converges to $f(x)$ uniformly in the limits of $t \rightarrow 0$ and $t \rightarrow \infty$ . This is an important result since it shows that the general procedure produces a good approximation for both very recent and very ancient history of the sample. Further, the paper shows how this method can be used to approximate quantities that depend on the trajectory of $n_t$ over time, which can be used to calculate interesting quantities such as the expected number of segregating sites in a genealogy.

## Approximating $E[n_t]$ for single populations

A difficulty with using the deterministic approximation is that $E[n_t]$ often has no closed-form formula, and if one exists, it is typically not easy to compute when the sample is large.

For a single population with changing size, two deterministic approximations have previously been developed (one by Slatkin and Rannala 1997, Volz et al. 2009 and one by Frost and Volz, 2010, Maruvka et al., 2011). Using theoretical and empirical methods, the authors examine the asymptotic behavior and computational complexity of these approximations and a Gaussian approximation by Griffiths. A summary of their results is in the table below.

 Method Accuracy Griffith’s approximation Accurate for large samples and recent history. Slatkin and Rannala (1997), Volz et al. (2009) Accurate for recent history and arbitrary sample size, inaccurate for very ancient history. Frost and Volz (2010), Maruvka et al. (2011) Accurate for both recent and ancient history and for arbitrary sample size. Jewett and Rosenberg (2014) Accurate for both recent and ancient history and arbitrary sample size, and for multiple populations with migration.

## Approximating $E[n_t]$ for multiple populations

Existing approaches only work for single populations of changing size and cannot account for migration between multiple populations. Ethan and Noah extend the framework for single populations to allow multiple populations with migration. The result is a system of simultaneous differential equations, one for each population. While it does not allow for analytical solutions except in very special cases, the system can be easily solved numerically for any given demographic scenario.

## Significance of this work

The extension of the coalescent framework to multiple populations with migration is an important result for demographic inference. The extended framework with multiple populations allows efficient computation of demographically informative quantities such as the expected number of private alleles in a sample, divergence times between populations.

Ethan and Noah describe a general procedure that can be used to approximate coalescent formulas that involve summing over distributions conditioned on $n_t$ or the trajectory of $n_t$ over time. This procedure is particularly accurate for studying very recent or very ancient genealogical history.

The analysis of existing approximations to $E[n_t]$ show that different approximations have different asymptotic behavior and computational complexities. The choice of which approximation to use is therefore often a tradeoff between the computational complexity of the approximation and the likely behavior of the approximation in the parameter ranges of interest.

## Future Directions

As increasingly large genomic samples from populations with complex demographic histories become available for study, exact methods either become intractable or very slow. This work adds to a growing set of approximations to the coalescent and its extensions, joining other methods such as conditional sampling distributions and the sequentially markov coalescent. Ethan and Noah are already exploring applications of these approximate methods to reconciling gene trees with species trees. In the future, I expect that these and other approximations will be important for fast and accurate analysis of large genomic datasets.

## References

[1] Jewett, E. M., & Rosenberg, N. A. (2014). Theory and applications of a deterministic approximation to the coalescent model. Theoretical population biology.

[2] Griffiths, R. C. (1984). Asymptotic line-of-descent distributions. Journal of Mathematical Biology21(1), 67-75.

[3] Frost, S. D., & Volz, E. M. (2010). Viral phylodynamics and the search for an ‘effective number of infections’. Philosophical Transactions of the Royal Society B: Biological Sciences365(1548), 1879-1890.

[4] Maruvka, Y. E., Shnerb, N. M., Bar-Yam, Y., & Wakeley, J. (2011). Recovering population parameters from a single gene genealogy: an unbiased estimator of the growth rate. Molecular biology and evolution28(5), 1617-1631.

[5] Slatkin, M., & Rannala, B. (1997). Estimating the age of alleles by use of intraallelic variability. American journal of human genetics60(2), 447.

[6] Slatkin, M. (2000). Allele age and a test for selection on rare alleles.Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences355(1403), 1663-1668.

[7] Volz, E. M., Pond, S. L. K., Ward, M. J., Brown, A. J. L., & Frost, S. D. (2009). Phylodynamics of infectious disease epidemics. Genetics183(4), 1421-1430.

Paper author Ethan Jewett is a PhD student in the lab of Noah Rosenberg.

# Demographic inference from genomic data in nonmodel insect populations

Blog author Martin Sikora is a postdoc in the lab of Carlos Bustamante.

Reconstructing the demographic history of species and populations is one of the major goals of evolutionary genetics. Inferring the timing and magnitude of past events in the history of a population is not only of interest in its own right, but also in order to form realistic null models for the expected patterns of neutral genetic variation in present-day natural populations. A variety of methods exist that allow the inference of these parameters from genomic data, which, in the absence of detailed historical records in most situations, is often the only feasible way to obtain them. As a consequence, it is generally not possible to empirically validate the parameters inferred from genomic data in a direct comparison with a known “truth” from a natural population. Furthermore, until recently, the application of these methods was limited to model organisms with well-developed genomic resources (e.g., humans and fruitflies), excluding a large number of non-model organisms with potentially considerable evolutionary and ecological interest.

## Chasing butterflies?

In an elegant study recently published in the journal Molecular Ecology, Rajiv McCoy, a graduate student with Dmitri Petrov and Carol Boggs, and colleagues tackle both of these problems in natural populations of Euphydryas gillettii, a species of butterfly native to the northern Rocky Mountains. About 30 years ago, a small founder population of this species from Wyoming was intentionally introduced to a new habitat at the Rocky Mountain Biological Laboratory field site in Colorado, and population sizes were recorded every year since the introduction. The beauty of this system is that it allows the authors to perform a direct comparison of the known demography (i.e. a recent split from the parental population and bottleneck ~30 generations ago, with census data in the newly introduced population) with estimates inferred from genomic data.

Gillete’s Checkerspot (Euphydryas gillettii). Photo taken by Carol Boggs, co-advisor of Rajiv and one of the senior authors of the study.

## A genomic dataset from a non-model organism

The researchers sampled eight larvae each from both the parental as well as the derived population for this study. In the world of model organisms, the next steps for constructing the dataset would be straightforward: Extract genomic DNA, sequence to the desired depth, map to the reference genome and finally call SNPs. In the case of E. gillettii however, no reference genome is available, so the authors had to use a different strategy. They decided to use RNA-sequencing in order to first build a reference transcriptome, which was then used as a reference sequence to map against and discover single nucleotide variants. An additional advantage of this approach is that the data generated can potentially also be utilized for other types of research questions, such as analyses of gene expression differences between the populations. On the downside, SNP calling from a transcriptome without a reference genome is challenging and can lead to false positives, for example due to reads from lowly expressed paralogs erroneously mapping to the highly expressed copy present in the assembled transcriptome. The authors therefore went to great lengths to stringently filter these false positive variants from their dataset.

## Demographic inference using δαδι

For the demographic inference, McCoy and colleagues used δαδι (diffusion approximation for demographic inference), a method developed by Ryan Gutenkunst while he was a postdoc in the group of CEHG faculty member Carlos Bustamante. This method uses a diffusion approximation to calculate the expected allele frequency spectrum under a demographic model of interest. The observed allele frequency spectrum is then fit to the expected spectrum by optimization of the demographic parameters to maximize the likelihood of the data. δαδι has been widely used to infer the demographic history of a number of species, from humans to domesticated rice, and is particularly suited to large-scale genomic datasets due to its flexibility and computational efficiency.

Excerpt of Figure 2 from McCoy et al., illustrating the demographic models tested using δαδι.

## Models vs History

The authors then fit a demographic model reflecting the known population history of E. gillettii, as illustrated in Figure 2 of their article (Model A). Encouragingly, they found that the model provided a very good fit to the data, with an the estimate of the split time between 40 and 47 generations ago, which is very close to the known time of establishment of the Colorado population 33 generations ago. Furthermore, they also tested how robust these results were to using a misspecified demographic model, by incorporating migration between the Colorado and Wyoming populations in their model (which in reality are isolated from each other). However, both alternative models with migration (Models B1 and B2) did not significantly improve the fit, again nicely consistent with the known population history.

## Three butterflies is enough?

Finally, the researchers also tested the robustness of the results to variations in the number of samples or SNPs used in the analysis, from datasets simulated under the best-fit model A. They found that δαδι performed remarkably well even with sample sizes as low as three individuals per population. While this is in principle good news for researchers limited by low number of available samples, one has to be aware of the fact that this results will be to a certain extent specific to this particular type of system, where one population undergoes a very strong bottleneck resulting in large effects on the allele frequency spectrum. A good strategy suggested by McCoy and colleagues is then to use these types of simulations in the planning stages of an experiment, in order to inform researchers of the number of samples and markers necessary to confidently estimate the demographic parameters of interest.

## Conclusions and future directions

For me, this study is a great example of how next-generation sequencing and sophisticated statistical modeling can open up a new world of possibilities to researchers interested in the ecology and evolution of natural populations. McCoy and colleagues constructed their genomic dataset essentially from scratch, without the “luxuries” of a reference genome or database of known polymorphisms. Moving forward, Rajiv has been busy collecting more samples over the past year. He and his colleagues plan to sequence over a thousand of them for the next phase of the project, as well as assemble a reference genome for E. gillettii, and important next step in the development of genomic tools for this fascinating ecological system.

The author of the paper Rajiv McCoy, sampling larvae of Euphydryas gillettii

McCoy, R. C., Garud, N. R., Kelley, J. L., Boggs, C. L. and Petrov, D. A. (2013), Genomic inference accurately predicts the timing and severity of a recent bottleneck in a nonmodel insect population. Molecular Ecology. doi: 10.1111/mec.12591