# BAPGXII Saturday May 30, 2015

Stanford is hosting the 12th Bay Area Population Genomics (BAPG) meeting. The Bay Area Population Genomics meeting is a great place to (re)connect with your pop gen/genomics colleagues in the area and to present your work in a talk or a poster.

BAPGXI, held in December at UC Davis, was a great event with over 100 participants and a line up of excellent talks. Thanks to the Coop lab! You can read more here, including the storified tweets. We are excited to continue this success at Stanford!

Logistics

The meeting will take place on May 30th on the Stanford campus in the Alway building, room M106. We start at 8:30AM with breakfast and registration, Dr. Dmitri Petrov’s opening remarks will begin at 9:25am, and the first talk will be at 9:30am. The last talk (Dr. Jonathan Pritchard’s keynote) ends at 2:10pm, followed by a poster session with amazing wine, beer, and cheese! Here is a general outline of the agenda, to help you plan your day:

Breakfast and Registration in Alway Courtyard 8:30-9:25am (pick up your BAPGXII gift!)
Opening Remarks 9:25-9:30am
Talk Session 1 9:30-10:30am (20 mins per talk)
Coffee Break in Courtyard 10:30-11am
Talk Session 2 11am-12pm (20 mins per talk)
Lunch in Courtyard 12-1pm
Talk Session 3 and Keynote 1-2:10pm (2 20 min talks and 1 30 min talk)
Poster Session with Wine, Beer, and Cheese Reception at 2:10pm, ends at 3pm

Talks and Posters

Sorry. Speaker and poster slots are now full. No longer accepting sign-ups.

How to Attend BAPGXII

1. Please register here by 10am Friday, May 29th to join us at BAPGXII. Registration is free and open to all, but required.

2. Encourage your colleagues to sign up! Forward this email to your lab mailing list and watch for updates on the CEHG Facebook page and on Twitter @StanfordCEHG. Help us get the momentum going by tweeting us using #BAPGXII.

3. And finally, once you’ve signed up, all you need to do is get up early and ride-share, VTA/Caltrain or bike to our beautiful campus on May 30th. Come for the science, stay for the social! Use the Stanford campus map and this Google Map to find the Alway Building, located at 500 Pasteur Drive, Stanford, CA. Be green and consider ride-sharing: there is a dedicated tab for making travel plans in the sign up doc!

We hope to see you at Stanford!

The BAPGXII organizing committee: Bridget Algee-Hewitt (@BridgetAH), David Enard (@DavidEnard), Katie Kanagawa (@KatieKanagawa), Alison Nguyen, Dmitri Petrov (@PetrovADmitri), Susanne Tilk, and Elena Yujuico. If you have any questions, feel free to contact Bridget Algee-Hewitt at bridgeta@stanford.edu.

# What’s Sardinia got to do with it? Ancient and modern genomes shed light on the genetic structure of Europe.

Blog author Yuan Zhu, formerly a PhD student in the Petrov lab, is now a Research Fellow at the Genome Institute of Singapore.

The Neolithic Revolution is the oldest documented agricultural revolution in human history. More than just the domestication of certain crops and animals, it describes a critical time in human history when hunter-gatherer groups transitioned into sedentary farming communities. This drastic change in lifestyle led to a major shift in living conditions and cultural practices, setting up the necessary prerequisites to support the kind of population density eventually possible in modern society.

In Central Europe, the Neolithic Revolution is thought to have taken place around 8,000-4,000 BC. Historians have long wondered about how farming was introduced and spread across the continents. Was the new practice brought in as novel ideas incorporated by local communities? Did new immigrants bring their lifestyle with them, possibly outcompeting existing hunter-gatherers and eventually displacing them all together? Was it perhaps even more complicated? What happened after?

## What Ötzi can tell us

Ancient human remains from around the time of the revolution can yield some insight. Ötzi the Tyrolean Iceman, a 5,300-year-old natural mummy found frozen in the Alps on the border of Italy and Austria, was recently shown (by a group that included CEHG researchers Martin Sikora and Carlos Bustamante) to belong to a Y-chromosome lineage mostly found in contemporary Sardinia [1]. This was surprising information. The Iceman’s life was spent in a narrow range within 60 km of his site of discovery [2]. He was unequivocally local, and clearly a farmer. Yet his lineage has since disappeared from Central Europe, suggesting that demographic scenarios were more complex than expected, and that at some point this Sardinian-like ancestry may have spanned Neolithic Europe.

A). The location of the discovery sites of ancient individuals studied, with hunter-gatherers (HG) represented as circles, and farming (F) individuals represented as squares. B). ADMIXTURE results of modern populations on the left panel, and inferred genetic composition of ancient individuals on the right. [Adapted from Figure 1, Sikora et al. 2014.]

## Sardinia: a genetic snapshot of the Neolithic?

In a recent paper published in PLOS Genetics, Sikora and colleagues sought to address this hypothesis by making full use of recent advancements in the sequencing of nuclear ancient DNA [3]. However, the Iceman alone was not sufficient to represent a continent. Ancient DNA sequences from six individuals from across Europe, including both farmer and hunter-gatherer individuals, were analyzed by the authors in order to paint a clearer picture of the demographics of Neolithic Europe. Two of the farmers were found in Bulgaria and were previously sequenced using an ancient DNA capture method developed by Sikora’s colleague in the Bustamante lab, Meredith Carpenter [4, and see blog post here]. In addition, Sikora made use of contemporary population SNP data, including sequence data from over 400 modern Sardinians, to provide a solid reference from which to estimate the true genetic affiliation of these ancient humans.

Some of the most interesting results from the analysis came from contrasts between the farmers (Iceman, gok4, and P192-1), the hunter-gatherers (ajv7 and brana1), and modern-day European populations. When the authors applied the clustering algorithm ADMIXTURE to the data, they found that the farmer individuals had significant portions of shared ancestry with modern Sardinians (Southern Europe), a characteristic largely absent in the HG individuals, who showed mainly Northern European (Basque) and Russian affiliated ancestry. Principal component analysis (PCA) and a statistic called the D-test agreed with high confidence—hunter-gatherers looked more Northern European, whereas farmers seemed more Sardinian than any other European group tested. TreeMix, a program that models population splits while allowing for admixture between branches, provided a similar answer when applied to the data from 1000 Genomes and the modern Sardinians, and further suggested a possible admixture scenario involving at least three major events, all of which falls neatly in line with previous work.

Taken together, the data support the authors’ original hypothesis—Sardinian-like ancestry was probably once common in Neolithic Europe. The Iceman, gok4, and P192-1 were discovered in very different locations, and P192-1 in particular was 2,000 years younger than the others, making it even more unlikely that all three were recurrent immigrants from Sardinia (which was thought to be uninhabited by hunter-gatherers prior to the Neolithic), and further suggesting that the lineage may have persisted for a while on the continent. In fact, Sikora and colleagues propose that Sardinia is a “modern-day ‘snapshot’ of the genetic structure of the people associated with the spread of agriculture in Europe.”

A proposed, highly simplified version of recent European demographic history. A). Early hunter-gatherers (closest to modern day Russian/Basque) were B). heavily influenced by an influx of farmers C) who spread across all of Europe and into Sardinia D). and subsequently maintained only in Sardinia due to genetic isolation. [Adapted from Figure 4, Sikora et al. 2014]

## Bridging the past and the future with ancient DNA

From here, the story is far from over. In fact, it only gets more complicated, and more work remains to be done. While a simplified model was proposed, the authors note that multiple sources of evidence suggest a far more complex and nuanced recent demographic history for Europe that we have yet to untangle. There are issues with ancient DNA sequences, such as characteristic DNA damage patterns, that are unique to the nature of the data. Potential issues with current methods being unable to handle such underlying patterns forced the authors to analyze every ancient DNA sample against modern populations individually. As with every advance in sequencing technology, with ancient DNA sequencing getting more accurate and accessible, new analytical methods must be developed to take full advantage of the data.

## References

Paper author Martin Sikora was a postdoctoral fellow in Carlos Bustamante’s lab. He is now a group leader at the Center for GeoGenetics in Copenhagen, Denmark.

# A fast and accurate coalescent approximation

Blog author Suyash Shringarpure is a postdoc in Carlos Bustamante’s lab. Suyash is interested in statistical and computational problems involved in the analysis of biological data.

The coalescent model is a powerful tool in the population geneticist’s toolbox. It traces the history of a sample back to its most recent common ancestor (MRCA) by looking at coalescence events between pairs of lineages. Starting from assumptions of random mating, selective neutrality, and constant population size, the coalescent uses a simple stochastic process that allows us to study properties of genealogies, such as the time to the MRCA and the length of the genealogy, analytically and through efficient simulation. Extensions to the coalescent allow us to incorporate effects of mutation, recombination, selection and demographic events in the coalescent model. A short introduction to the coalescent model can be found here and a longer, more detailed introduction can be read here.

However, coalescent analyses can be slow or suffer from numerical instability, especially for large samples. In a study published earlier this year in Theoretical Population Biology, CEHG fellow Ethan Jewett and CEHG professor Noah Rosenberg proposed fast and accurate approximations to general coalescent formulas and procedures for applying such approximations. Their work also examined the asymptotic behavior of existing coalescent approximations analytically and empirically.

## Computational challenges with the coalescent

For a given sample, there are many possible genealogical histories, i.e., tree topologies and branch lengths, which are consistent with the allelic states of the sample. Analyses involving the coalescent therefore often require us to condition on a specific genealogical property and then sum over all possible genealogies that display the property, weighted by the probability of the genealogy. A genealogical property that is often conditioned on is $n_t$, the number of ancestral lineages in the genealogy at a time $t$ in the past. However, computing the distribution $P(n_t)$ of $n_t$ is computationally expensive for large samples and can suffer from numerical instability.

## A general approximation procedure for formulas conditioning on $n_t$

Coalescent formulas conditioning on $n_t$ typically involve sums of the form $f(x)=\sum_{n_t} f(x|n_t) \cdot P(n_t)$

For large samples and recent times, these computations have two drawbacks:

–       The range of possible values for $n_t$ may be quite large (especially if multiple populations are being analyzed) and a summation over these values may be computationally expensive.

–       Expressions for $P(n_t)$ are susceptible to round-off errors.

Slatkin (2000) proposed an approximation to the summation in $f(x)$ by a single term $f(x|E[n_t])$. This deterministic approximation was based on the observation that $n_t$ changes almost deterministically over time, even though it is a stochastic variable in theory. Thus we can write $n_t \approx E[n_t]$. From Figure 2 in the paper (reproduced here), we can see that this approximation is quite accurate. The authors prove the asymptotic accuracy of this approximation and also prove that under regularity assumptions, $f(x|E[n_t])$ converges to $f(x)$ uniformly in the limits of $t \rightarrow 0$ and $t \rightarrow \infty$ . This is an important result since it shows that the general procedure produces a good approximation for both very recent and very ancient history of the sample. Further, the paper shows how this method can be used to approximate quantities that depend on the trajectory of $n_t$ over time, which can be used to calculate interesting quantities such as the expected number of segregating sites in a genealogy.

## Approximating $E[n_t]$ for single populations

A difficulty with using the deterministic approximation is that $E[n_t]$ often has no closed-form formula, and if one exists, it is typically not easy to compute when the sample is large.

For a single population with changing size, two deterministic approximations have previously been developed (one by Slatkin and Rannala 1997, Volz et al. 2009 and one by Frost and Volz, 2010, Maruvka et al., 2011). Using theoretical and empirical methods, the authors examine the asymptotic behavior and computational complexity of these approximations and a Gaussian approximation by Griffiths. A summary of their results is in the table below.

 Method Accuracy Griffith’s approximation Accurate for large samples and recent history. Slatkin and Rannala (1997), Volz et al. (2009) Accurate for recent history and arbitrary sample size, inaccurate for very ancient history. Frost and Volz (2010), Maruvka et al. (2011) Accurate for both recent and ancient history and for arbitrary sample size. Jewett and Rosenberg (2014) Accurate for both recent and ancient history and arbitrary sample size, and for multiple populations with migration.

## Approximating $E[n_t]$ for multiple populations

Existing approaches only work for single populations of changing size and cannot account for migration between multiple populations. Ethan and Noah extend the framework for single populations to allow multiple populations with migration. The result is a system of simultaneous differential equations, one for each population. While it does not allow for analytical solutions except in very special cases, the system can be easily solved numerically for any given demographic scenario.

## Significance of this work

The extension of the coalescent framework to multiple populations with migration is an important result for demographic inference. The extended framework with multiple populations allows efficient computation of demographically informative quantities such as the expected number of private alleles in a sample, divergence times between populations.

Ethan and Noah describe a general procedure that can be used to approximate coalescent formulas that involve summing over distributions conditioned on $n_t$ or the trajectory of $n_t$ over time. This procedure is particularly accurate for studying very recent or very ancient genealogical history.

The analysis of existing approximations to $E[n_t]$ show that different approximations have different asymptotic behavior and computational complexities. The choice of which approximation to use is therefore often a tradeoff between the computational complexity of the approximation and the likely behavior of the approximation in the parameter ranges of interest.

## Future Directions

As increasingly large genomic samples from populations with complex demographic histories become available for study, exact methods either become intractable or very slow. This work adds to a growing set of approximations to the coalescent and its extensions, joining other methods such as conditional sampling distributions and the sequentially markov coalescent. Ethan and Noah are already exploring applications of these approximate methods to reconciling gene trees with species trees. In the future, I expect that these and other approximations will be important for fast and accurate analysis of large genomic datasets.

## References

[1] Jewett, E. M., & Rosenberg, N. A. (2014). Theory and applications of a deterministic approximation to the coalescent model. Theoretical population biology.

[2] Griffiths, R. C. (1984). Asymptotic line-of-descent distributions. Journal of Mathematical Biology21(1), 67-75.

[3] Frost, S. D., & Volz, E. M. (2010). Viral phylodynamics and the search for an ‘effective number of infections’. Philosophical Transactions of the Royal Society B: Biological Sciences365(1548), 1879-1890.

[4] Maruvka, Y. E., Shnerb, N. M., Bar-Yam, Y., & Wakeley, J. (2011). Recovering population parameters from a single gene genealogy: an unbiased estimator of the growth rate. Molecular biology and evolution28(5), 1617-1631.

[5] Slatkin, M., & Rannala, B. (1997). Estimating the age of alleles by use of intraallelic variability. American journal of human genetics60(2), 447.

[6] Slatkin, M. (2000). Allele age and a test for selection on rare alleles.Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences355(1403), 1663-1668.

[7] Volz, E. M., Pond, S. L. K., Ward, M. J., Brown, A. J. L., & Frost, S. D. (2009). Phylodynamics of infectious disease epidemics. Genetics183(4), 1421-1430.

Paper author Ethan Jewett is a PhD student in the lab of Noah Rosenberg.

# Missing the forest for the trees: How frequent adaptation can confound its own inference

Blog author: Fabian Staubach was a postdoc in Dmitri Petrov’s lab and is now an assistant professor in Freiburg, Germany.

This post was written by Fabian Staubach.

The neutral theory of molecular evolution assumes that adaptation is rare and that the effect of adaptation on linked variation, the so-called hitchhiking effect, typically has only little influence on the dynamics of molecular genetic variation. Because of this assumption, it is widely assumed that in most natural populations, hitchhiking can be neglected, or at least reasonably well approximated by the introduction of effective parameters, such as an effective population size. But if molecular adaptation is in fact common, then the assumption may be violated, and we should worry whether population genetic methods and estimates of evolutionary parameters obtained from them are robust to frequent hitchhiking.

In their paper “Frequent adaptation and the McDonald-Kreitman test” (PNAS, 2013), Philipp Messer and Dmitri Petrov investigate this question for one of the key population genetic methods — the McDonald-Kreitman (MK) test. This test forms the basis of most commonly used approaches to measure the rate of adaptation from population genomic data and has been used to argue that in some organisms, such as Drosophila, the rate of adaptation is surprisingly high.

## The MK test can substantially underestimate the true rate of adaptation

Messer and Petrov employ their powerful forward simulation software, SLiM (see here), to simulate the evolution of entire chromosomes under a range of parameter values relevant to humans and other organisms, and apply various forms of the MK test to the population genomic data resulting from their simulations. They then study how accurately these methods re-infer the true evolutionary parameters in the simulations. Strikingly, they find that the MK test can substantially underestimate the true rate of adaptation. For instance, they present scenarios where 40% of the amino acid changing substitutions were in fact strongly adaptive in the simulations, while other population parameters resembled those commonly inferred for human evolution, yet the standard MK estimates yield that none of these substitutions were actually adaptive. Fortunately, Messer and Petrov propose a way to avoid these problems by using a simple, asymptotic extension of the MK test.

Figure: Illustration of the asymptotic MK estimation of the rate of adaptive substitutions : The standard MK approach assumes that all polymorphisms (non-synonymous and synonymous) are neutral. This assumption is likely violated for low frequency polymorphisms, as some of these are likely to be deleterious. The assumption should hold for very high frequency polymorphisms, because they are very unlikely to be deleterious. The asymptotic MK approach uses this fact by looking at the estimated rate from different frequency classes of alleles, and extrapolating to x=1, where the rate is expected to have asymptoted.

The bigger claim of this straightforward and easy-to-read paper is that the effects of linked selection cannot be simply swept under the rug by introducing effective parameters, such as effective population size or effective strength of selection, and then using these effective parameters in formulae derived from the diffusion approximation under the assumption of free recombination.

## Quantifying known biases

Surely, this paper will ruffle some feathers. Some people will argue that these problems have been know for a while in theory. Yet despite this, the vast majority of studies that continue to appear in the literature still pay only cursory lip service, if anything, to these issues. Presumably, this is because it is not well understood analytically to what extent linkage effects affect population genetic estimates, and Messer and Petrov therefore do an important job in quantifying these biases. Hopefully this will help focus the community’s attention to spend some time figuring out how to modify commonly used approaches to place them on a more solid foundation.

Citation: Messer, P. W., & Petrov, D. A. (2013). Frequent adaptation and the McDonald-Kreitman test. Proceedings of the National Academy of Sciences of the United States of America, 110(21), 8615–20. doi:10.1073/pnas.1220835110

Paper author: Philipp Messer is a research associate in Dmitri Petrov’s lab at Stanford, where he studies the population genetics of adaptation using theoretical and computational approaches in concert with the analysis of large-scale population genomic data.