# A framework for identifying and quantifying fitness effects across loci

Blog author Ethan Jewett is a PhD student in the lab of Noah Rosenberg.

The degree to which similarities and differences among species are the result of natural selection, rather than genetic drift, is a major question in population genetics. Related questions include: what fraction of sites in the genome of a species are affected by selection? What is the distribution of the strength of selection across genomic sites, and how have selective pressures changed over time? To address these questions, we must be able to accurately identify sites in a genome that are under selection and quantify the selective pressures that act on them.

## Difficulties with existing approaches for quantifying fitness effects

A recent paper in Trends in Genetics by David Lawrie and Dmitri Petrov (Lawrie and Petrov, 2014) provides intuition about the power of existing methods for identifying genomic regions affected by purifying selection and for quantifying the selective pressures at different sites. The paper proposes a new framework for quantifying the distribution of fitness effects across a genome. This new framework is a synthesis of two existing forms of analysis – comparative genomic analyses to identify genomic regions in which the level of divergence among two or more species is smaller than expected, and analyses of the distribution of the frequencies of polymorphisms (the site frequency spectrum, or SFS) within a single species (Figure 1). Using simulations and heuristic arguments, Lawrie and Petrov demonstrate that these two forms of analysis can be combined into a framework for quantifying selective pressures that has greater power to identify selected regions and to quantify selective strengths than either approach has on its own.

Figure 1. Using the site frequency spectrum (SFS) to quantify the strength of purifying selection. The SFS tabulates the number of polymorphisms at a given frequency in a sample of haplotypes. Under neutrality (black dots) many high-frequency polymorphisms are observed. Under purifying selection (higher values of the effective selection strength |4Nes|), a higher fraction of new mutations are deleterious, leading to fewer high-frequency polymorphisms (red and blue dots). Adapted from Lawrie and Petrov (2014).

Lawrie and Petrov begin by discussing the strengths and weaknesses of the two existing approaches. Comparative analyses of genomic divergence are beneficial for identifying genomic regions under purifying selection, which will exhibit lower-than-expected levels of divergence among species. However, as Lawrie and Petrov note, it can be difficult to use comparative analyses to quantify the strength of selection in a region because even mild purifying selection can result in complete conservation among species within the region (Figure 2). For example, whether the population-scaled selective strength, 4Nes, in a region is 20 or 200, the same genomic signal will be observed, complete conservation.

Figure 1. Adapted from Lawrie and Petrov (2013). The evolution of several 100kb regions was simulated in 32 different mammalian species under varying strengths of selection |4Nes|. The number of substitutions in each region was then estimated using genomic evolutionary rate profiling (GERP). The plot shows the median across regions of the number of inferred substitutions. From the plot, it can be seen that, once the strength of selection exceeds a weak threshold value (3 for the example given), there is full conservation among species.

In contrast to comparative approaches, analyses of within-species polymorphisms based on the site frequency spectrum (SFS) within a region can be used to more precisely quantify the strength of selection. For example, Figure 1 shows that different selective strengths can produce very different site frequency spectra. Moreover, if the SFS can be estimated precisely enough, it can allow us to distinguish between two different selective strengths (e.g., 4Nes1 = 20 and 4Nes2 = 200) that would both lead to total conservation in a comparative study, and would therefore be indistinguishable. The problem is that it takes a lot of polymorphisms to obtain an accurate estimate of the SFS, and a genomic region of interest may contain too few polymorphisms, especially if the region is under purifying selection, which decreases the apparent mutation rate. Sampling additional individuals from the same species may provide little additional information about the SFS because few novel polymorphisms may be observed in the additional sample. For example, recall that for a sample of n individuals from a wildly idealized panmictic species, the expected number of novel polymorphisms observed in the n+1st sampled individual is proportional to 1/n (Watterson1975).

Lawrie and Petrov demonstrate that studying polymorphisms by sampling many individuals across several related species (rather than sampling more individuals within a single species) could increase the observed number of polymorphisms in a region, and therefore, could increase the power to quantify the strength of selection (Figure 3) – as long as the selective forces in the genomic region are sufficiently similar across the different species.

￼Figure 3. The benefits of studying polymorphisms in many populations, rather than within a single population. Three populations (A, B, and C) diverge from an ancestral population, D. The genealogy of a single region is shown (slanted lines) with mutations in the region denoted by orange slashes. Additional lineages sampled in population A are likely to coalesce recently with other lineages (for example, the red clade in population A ) and, therefore, carry few mutations that have not already been observed in the sample. In comparison, the same number of lineages sampled from a second population are likely to carry additional independent polymorphisms (for example, the red lineages in population B). If the selective pressures at the locus in populations A and B are similar, then the SFS in the two populations should be similar, and the additional lineages in B can provide additional information about the SFS. For example, if the demographic histories and selective pressures at the locus are identical in populations A and B, and if the samples from populations A and B are sufficiently diverged, then a sample of K lineages from each population, A and B, will contain double the number of independent polymorphisms that are observed in a sample of K lineages from population A alone, providing double the number of mutations that can be used to estimate the SFS.

## The need for sampling depth and breadth

Without getting bogged down in the details, it’s the rare variants that are often the most important for quantifying the effects of purifying selection, so one still has to sample deeply within each species; however, overall, sampling from additional species is a more efficient way of increasing the absolute number of variants that can be used to estimate the SFS in a region, compared with sampling more deeply within the same species.

The simulations and heuristic arguments presented by Lawrie and Petrov consider idealized cases for simplicity; however, the usefulness of approaches that consider polymorphisms across multiple species has been demonstrated in methods such as the McDonald-Kreitman test (McDonald and Kreitman, 1991), which have long been important tools for studying selection. More recent empirical applications of approaches that consider information about polymorphisms across multiple species appear to do a good job of quantifying selective pressures across genomes (Wilson et al., 2011; Gronau et al., 2013), even when species are closely related (De Maio et al., 2013). Overall, the simulations and arguments presented in Lawrie and Petrov’s paper provide useful guidelines for researchers interested in identifying and quantifying selective forces, and their recommendation to sample deeply within species and broadly across many species comes at a time when such analyses are becoming increasingly practical, given the recent availability of sequencing data from many species.

## References:

Paper author: David Lawrie was a graduate student in Dmitri Petrov’s lab. He is now a postdoc at USC.

# A fast and accurate coalescent approximation

Blog author Suyash Shringarpure is a postdoc in Carlos Bustamante’s lab. Suyash is interested in statistical and computational problems involved in the analysis of biological data.

The coalescent model is a powerful tool in the population geneticist’s toolbox. It traces the history of a sample back to its most recent common ancestor (MRCA) by looking at coalescence events between pairs of lineages. Starting from assumptions of random mating, selective neutrality, and constant population size, the coalescent uses a simple stochastic process that allows us to study properties of genealogies, such as the time to the MRCA and the length of the genealogy, analytically and through efficient simulation. Extensions to the coalescent allow us to incorporate effects of mutation, recombination, selection and demographic events in the coalescent model. A short introduction to the coalescent model can be found here and a longer, more detailed introduction can be read here.

However, coalescent analyses can be slow or suffer from numerical instability, especially for large samples. In a study published earlier this year in Theoretical Population Biology, CEHG fellow Ethan Jewett and CEHG professor Noah Rosenberg proposed fast and accurate approximations to general coalescent formulas and procedures for applying such approximations. Their work also examined the asymptotic behavior of existing coalescent approximations analytically and empirically.

## Computational challenges with the coalescent

For a given sample, there are many possible genealogical histories, i.e., tree topologies and branch lengths, which are consistent with the allelic states of the sample. Analyses involving the coalescent therefore often require us to condition on a specific genealogical property and then sum over all possible genealogies that display the property, weighted by the probability of the genealogy. A genealogical property that is often conditioned on is $n_t$, the number of ancestral lineages in the genealogy at a time $t$ in the past. However, computing the distribution $P(n_t)$ of $n_t$ is computationally expensive for large samples and can suffer from numerical instability.

## A general approximation procedure for formulas conditioning on $n_t$

Coalescent formulas conditioning on $n_t$ typically involve sums of the form $f(x)=\sum_{n_t} f(x|n_t) \cdot P(n_t)$

For large samples and recent times, these computations have two drawbacks:

–       The range of possible values for $n_t$ may be quite large (especially if multiple populations are being analyzed) and a summation over these values may be computationally expensive.

–       Expressions for $P(n_t)$ are susceptible to round-off errors.

Slatkin (2000) proposed an approximation to the summation in $f(x)$ by a single term $f(x|E[n_t])$. This deterministic approximation was based on the observation that $n_t$ changes almost deterministically over time, even though it is a stochastic variable in theory. Thus we can write $n_t \approx E[n_t]$. From Figure 2 in the paper (reproduced here), we can see that this approximation is quite accurate. The authors prove the asymptotic accuracy of this approximation and also prove that under regularity assumptions, $f(x|E[n_t])$ converges to $f(x)$ uniformly in the limits of $t \rightarrow 0$ and $t \rightarrow \infty$ . This is an important result since it shows that the general procedure produces a good approximation for both very recent and very ancient history of the sample. Further, the paper shows how this method can be used to approximate quantities that depend on the trajectory of $n_t$ over time, which can be used to calculate interesting quantities such as the expected number of segregating sites in a genealogy.

## Approximating $E[n_t]$ for single populations

A difficulty with using the deterministic approximation is that $E[n_t]$ often has no closed-form formula, and if one exists, it is typically not easy to compute when the sample is large.

For a single population with changing size, two deterministic approximations have previously been developed (one by Slatkin and Rannala 1997, Volz et al. 2009 and one by Frost and Volz, 2010, Maruvka et al., 2011). Using theoretical and empirical methods, the authors examine the asymptotic behavior and computational complexity of these approximations and a Gaussian approximation by Griffiths. A summary of their results is in the table below.

 Method Accuracy Griffith’s approximation Accurate for large samples and recent history. Slatkin and Rannala (1997), Volz et al. (2009) Accurate for recent history and arbitrary sample size, inaccurate for very ancient history. Frost and Volz (2010), Maruvka et al. (2011) Accurate for both recent and ancient history and for arbitrary sample size. Jewett and Rosenberg (2014) Accurate for both recent and ancient history and arbitrary sample size, and for multiple populations with migration.

## Approximating $E[n_t]$ for multiple populations

Existing approaches only work for single populations of changing size and cannot account for migration between multiple populations. Ethan and Noah extend the framework for single populations to allow multiple populations with migration. The result is a system of simultaneous differential equations, one for each population. While it does not allow for analytical solutions except in very special cases, the system can be easily solved numerically for any given demographic scenario.

## Significance of this work

The extension of the coalescent framework to multiple populations with migration is an important result for demographic inference. The extended framework with multiple populations allows efficient computation of demographically informative quantities such as the expected number of private alleles in a sample, divergence times between populations.

Ethan and Noah describe a general procedure that can be used to approximate coalescent formulas that involve summing over distributions conditioned on $n_t$ or the trajectory of $n_t$ over time. This procedure is particularly accurate for studying very recent or very ancient genealogical history.

The analysis of existing approximations to $E[n_t]$ show that different approximations have different asymptotic behavior and computational complexities. The choice of which approximation to use is therefore often a tradeoff between the computational complexity of the approximation and the likely behavior of the approximation in the parameter ranges of interest.

## Future Directions

As increasingly large genomic samples from populations with complex demographic histories become available for study, exact methods either become intractable or very slow. This work adds to a growing set of approximations to the coalescent and its extensions, joining other methods such as conditional sampling distributions and the sequentially markov coalescent. Ethan and Noah are already exploring applications of these approximate methods to reconciling gene trees with species trees. In the future, I expect that these and other approximations will be important for fast and accurate analysis of large genomic datasets.

## References

[1] Jewett, E. M., & Rosenberg, N. A. (2014). Theory and applications of a deterministic approximation to the coalescent model. Theoretical population biology.

[2] Griffiths, R. C. (1984). Asymptotic line-of-descent distributions. Journal of Mathematical Biology21(1), 67-75.

[3] Frost, S. D., & Volz, E. M. (2010). Viral phylodynamics and the search for an ‘effective number of infections’. Philosophical Transactions of the Royal Society B: Biological Sciences365(1548), 1879-1890.

[4] Maruvka, Y. E., Shnerb, N. M., Bar-Yam, Y., & Wakeley, J. (2011). Recovering population parameters from a single gene genealogy: an unbiased estimator of the growth rate. Molecular biology and evolution28(5), 1617-1631.

[5] Slatkin, M., & Rannala, B. (1997). Estimating the age of alleles by use of intraallelic variability. American journal of human genetics60(2), 447.

[6] Slatkin, M. (2000). Allele age and a test for selection on rare alleles.Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences355(1403), 1663-1668.

[7] Volz, E. M., Pond, S. L. K., Ward, M. J., Brown, A. J. L., & Frost, S. D. (2009). Phylodynamics of infectious disease epidemics. Genetics183(4), 1421-1430.

Paper author Ethan Jewett is a PhD student in the lab of Noah Rosenberg.