What’s Sardinia got to do with it? Ancient and modern genomes shed light on the genetic structure of Europe.

Blog author Yuan Zhu is a graduate student in Dmitri Petrov's lab

Blog author Yuan Zhu, formerly a PhD student in the Petrov lab, is now a Research Fellow at the Genome Institute of Singapore.

The Neolithic Revolution is the oldest documented agricultural revolution in human history. More than just the domestication of certain crops and animals, it describes a critical time in human history when hunter-gatherer groups transitioned into sedentary farming communities. This drastic change in lifestyle led to a major shift in living conditions and cultural practices, setting up the necessary prerequisites to support the kind of population density eventually possible in modern society.

In Central Europe, the Neolithic Revolution is thought to have taken place around 8,000-4,000 BC. Historians have long wondered about how farming was introduced and spread across the continents. Was the new practice brought in as novel ideas incorporated by local communities? Did new immigrants bring their lifestyle with them, possibly outcompeting existing hunter-gatherers and eventually displacing them all together? Was it perhaps even more complicated? What happened after?

What Ötzi can tell us

Ancient human remains from around the time of the revolution can yield some insight. Ötzi the Tyrolean Iceman, a 5,300-year-old natural mummy found frozen in the Alps on the border of Italy and Austria, was recently shown (by a group that included CEHG researchers Martin Sikora and Carlos Bustamante) to belong to a Y-chromosome lineage mostly found in contemporary Sardinia [1]. This was surprising information. The Iceman’s life was spent in a narrow range within 60 km of his site of discovery [2]. He was unequivocally local, and clearly a farmer. Yet his lineage has since disappeared from Central Europe, suggesting that demographic scenarios were more complex than expected, and that at some point this Sardinian-like ancestry may have spanned Neolithic Europe.

A). The location of the discovery sites of ancient individuals studied, with hunter-gatherers (HG) represented as circles, and farming (F) individuals represented as squares. B). ADMIXTURE results of modern populations on the left panel, and inferred genetic composition of ancient individuals on the right. [Adapted from Figure 1, Sikora et al. 2014.]

A). The location of the discovery sites of ancient individuals studied, with hunter-gatherers (HG) represented as circles, and farming (F) individuals represented as squares. B). ADMIXTURE results of modern populations on the left panel, and inferred genetic composition of ancient individuals on the right. [Adapted from Figure 1, Sikora et al. 2014.]

Sardinia: a genetic snapshot of the Neolithic?

In a recent paper published in PLOS Genetics, Sikora and colleagues sought to address this hypothesis by making full use of recent advancements in the sequencing of nuclear ancient DNA [3]. However, the Iceman alone was not sufficient to represent a continent. Ancient DNA sequences from six individuals from across Europe, including both farmer and hunter-gatherer individuals, were analyzed by the authors in order to paint a clearer picture of the demographics of Neolithic Europe. Two of the farmers were found in Bulgaria and were previously sequenced using an ancient DNA capture method developed by Sikora’s colleague in the Bustamante lab, Meredith Carpenter [4, and see blog post here]. In addition, Sikora made use of contemporary population SNP data, including sequence data from over 400 modern Sardinians, to provide a solid reference from which to estimate the true genetic affiliation of these ancient humans.

Some of the most interesting results from the analysis came from contrasts between the farmers (Iceman, gok4, and P192-1), the hunter-gatherers (ajv7 and brana1), and modern-day European populations. When the authors applied the clustering algorithm ADMIXTURE to the data, they found that the farmer individuals had significant portions of shared ancestry with modern Sardinians (Southern Europe), a characteristic largely absent in the HG individuals, who showed mainly Northern European (Basque) and Russian affiliated ancestry. Principal component analysis (PCA) and a statistic called the D-test agreed with high confidence—hunter-gatherers looked more Northern European, whereas farmers seemed more Sardinian than any other European group tested. TreeMix, a program that models population splits while allowing for admixture between branches, provided a similar answer when applied to the data from 1000 Genomes and the modern Sardinians, and further suggested a possible admixture scenario involving at least three major events, all of which falls neatly in line with previous work.

Taken together, the data support the authors’ original hypothesis—Sardinian-like ancestry was probably once common in Neolithic Europe. The Iceman, gok4, and P192-1 were discovered in very different locations, and P192-1 in particular was 2,000 years younger than the others, making it even more unlikely that all three were recurrent immigrants from Sardinia (which was thought to be uninhabited by hunter-gatherers prior to the Neolithic), and further suggesting that the lineage may have persisted for a while on the continent. In fact, Sikora and colleagues propose that Sardinia is a “modern-day ‘snapshot’ of the genetic structure of the people associated with the spread of agriculture in Europe.”

A proposed, highly simplified version of recent European demographic history. A). Early hunter-gatherers (closest to modern day Russian/Basque) were B). heavily influenced by an influx of farmers C) who spread across all of Europe and into Sardinia D). and subsequently maintained only in Sardinia due to genetic isolation. [Adapted from Figure 4, Sikora et al. 2014]

A proposed, highly simplified version of recent European demographic history. A). Early hunter-gatherers (closest to modern day Russian/Basque) were B). heavily influenced by an influx of farmers C) who spread across all of Europe and into Sardinia D). and subsequently maintained only in Sardinia due to genetic isolation. [Adapted from Figure 4, Sikora et al. 2014]

Bridging the past and the future with ancient DNA

From here, the story is far from over. In fact, it only gets more complicated, and more work remains to be done. While a simplified model was proposed, the authors note that multiple sources of evidence suggest a far more complex and nuanced recent demographic history for Europe that we have yet to untangle. There are issues with ancient DNA sequences, such as characteristic DNA damage patterns, that are unique to the nature of the data. Potential issues with current methods being unable to handle such underlying patterns forced the authors to analyze every ancient DNA sample against modern populations individually. As with every advance in sequencing technology, with ancient DNA sequencing getting more accurate and accessible, new analytical methods must be developed to take full advantage of the data.

References

[1] Keller A, Graefen A, Ball M, Matzas M, Boisguerin V, et al. (2012) New insights into the Tyrolean Iceman’s origin and phenotype as inferred by whole-genome sequencing. Nature Communications 3: 698.
[2] Müller W, Fricke H, Halliday AN, McCulloch MT, Wartho J-A (2003) Origin and Migration of the Alpine Iceman. Science 302: 862–866. doi: 10.1126/science.1089837
[3] Sikora M, Carpenter ML, Moreno-Estrada A, Henn BM, Underhill PA, et al. (2014) Population Genomic Analysis of Ancient and Modern Genomes Yields New Insights into the Genetic Ancestry of the Tyrolean Iceman and the Genetic Structure of Europe. PLOS Genetics, DOI:10.1371/journal.pgen.1004353
[4] Carpenter, ML, Buenrostro, JD, Valdiosera, C, Schroeder, H, Allentoft, ME, Sikora, M, Rasmussen, M, et al. (2013). Pulling out the 1%: Whole-Genome Capture for the Targeted Enrichment of Ancient DNA Sequencing Libraries. Am J Hum Genet. 2013 Nov 7;93(5):852-64. doi: 10.1016/j.ajhg.2013.10.002.

Paper author: Martin Sikora was a postdoc in Carlos Bustamante's lab. He is now a group leader at the Centre for GeoGenetics in Copenhagen, Denmark.

Paper author Martin Sikora was a postdoctoral fellow in Carlos Bustamante’s lab. He is now a group leader at the Center for GeoGenetics in Copenhagen, Denmark.

Demographic inference from genomic data in nonmodel insect populations

Blog author Martin Sikora is a postdoc in the lab of Carlos Bustamante.

Blog author Martin Sikora is a postdoc in the lab of Carlos Bustamante.

Reconstructing the demographic history of species and populations is one of the major goals of evolutionary genetics. Inferring the timing and magnitude of past events in the history of a population is not only of interest in its own right, but also in order to form realistic null models for the expected patterns of neutral genetic variation in present-day natural populations. A variety of methods exist that allow the inference of these parameters from genomic data, which, in the absence of detailed historical records in most situations, is often the only feasible way to obtain them. As a consequence, it is generally not possible to empirically validate the parameters inferred from genomic data in a direct comparison with a known “truth” from a natural population. Furthermore, until recently, the application of these methods was limited to model organisms with well-developed genomic resources (e.g., humans and fruitflies), excluding a large number of non-model organisms with potentially considerable evolutionary and ecological interest.

Chasing butterflies?

In an elegant study recently published in the journal Molecular Ecology, Rajiv McCoy, a graduate student with Dmitri Petrov and Carol Boggs, and colleagues tackle both of these problems in natural populations of Euphydryas gillettii, a species of butterfly native to the northern Rocky Mountains. About 30 years ago, a small founder population of this species from Wyoming was intentionally introduced to a new habitat at the Rocky Mountain Biological Laboratory field site in Colorado, and population sizes were recorded every year since the introduction. The beauty of this system is that it allows the authors to perform a direct comparison of the known demography (i.e. a recent split from the parental population and bottleneck ~30 generations ago, with census data in the newly introduced population) with estimates inferred from genomic data.

Gillete’s Checkerspot (Euphydryas gillettii). Photo taken by Carol Boggs, co-advisor of Rajiv and one of the senior authors of the study.

Gillete’s Checkerspot (Euphydryas gillettii). Photo taken by Carol Boggs, co-advisor of Rajiv and one of the senior authors of the study.

A genomic dataset from a non-model organism

The researchers sampled eight larvae each from both the parental as well as the derived population for this study. In the world of model organisms, the next steps for constructing the dataset would be straightforward: Extract genomic DNA, sequence to the desired depth, map to the reference genome and finally call SNPs. In the case of E. gillettii however, no reference genome is available, so the authors had to use a different strategy. They decided to use RNA-sequencing in order to first build a reference transcriptome, which was then used as a reference sequence to map against and discover single nucleotide variants. An additional advantage of this approach is that the data generated can potentially also be utilized for other types of research questions, such as analyses of gene expression differences between the populations. On the downside, SNP calling from a transcriptome without a reference genome is challenging and can lead to false positives, for example due to reads from lowly expressed paralogs erroneously mapping to the highly expressed copy present in the assembled transcriptome. The authors therefore went to great lengths to stringently filter these false positive variants from their dataset.

Demographic inference using δαδι

For the demographic inference, McCoy and colleagues used δαδι (diffusion approximation for demographic inference), a method developed by Ryan Gutenkunst while he was a postdoc in the group of CEHG faculty member Carlos Bustamante. This method uses a diffusion approximation to calculate the expected allele frequency spectrum under a demographic model of interest. The observed allele frequency spectrum is then fit to the expected spectrum by optimization of the demographic parameters to maximize the likelihood of the data. δαδι has been widely used to infer the demographic history of a number of species, from humans to domesticated rice, and is particularly suited to large-scale genomic datasets due to its flexibility and computational efficiency.

Excerpt of Figure 2 from McCoy et al., illustrating the demographic models tested using δαδι.

Excerpt of Figure 2 from McCoy et al., illustrating the demographic models tested using δαδι.

Models vs History

The authors then fit a demographic model reflecting the known population history of E. gillettii, as illustrated in Figure 2 of their article (Model A). Encouragingly, they found that the model provided a very good fit to the data, with an the estimate of the split time between 40 and 47 generations ago, which is very close to the known time of establishment of the Colorado population 33 generations ago. Furthermore, they also tested how robust these results were to using a misspecified demographic model, by incorporating migration between the Colorado and Wyoming populations in their model (which in reality are isolated from each other). However, both alternative models with migration (Models B1 and B2) did not significantly improve the fit, again nicely consistent with the known population history.

Three butterflies is enough?

Finally, the researchers also tested the robustness of the results to variations in the number of samples or SNPs used in the analysis, from datasets simulated under the best-fit model A. They found that δαδι performed remarkably well even with sample sizes as low as three individuals per population. While this is in principle good news for researchers limited by low number of available samples, one has to be aware of the fact that this results will be to a certain extent specific to this particular type of system, where one population undergoes a very strong bottleneck resulting in large effects on the allele frequency spectrum. A good strategy suggested by McCoy and colleagues is then to use these types of simulations in the planning stages of an experiment, in order to inform researchers of the number of samples and markers necessary to confidently estimate the demographic parameters of interest.

Conclusions and future directions

For me, this study is a great example of how next-generation sequencing and sophisticated statistical modeling can open up a new world of possibilities to researchers interested in the ecology and evolution of natural populations. McCoy and colleagues constructed their genomic dataset essentially from scratch, without the “luxuries” of a reference genome or database of known polymorphisms. Moving forward, Rajiv has been busy collecting more samples over the past year. He and his colleagues plan to sequence over a thousand of them for the next phase of the project, as well as assemble a reference genome for E. gillettii, and important next step in the development of genomic tools for this fascinating ecological system.

The author of the paper Rajiv McCoy, sampling larvae of Euphydryas gillettii

The author of the paper Rajiv McCoy, sampling larvae of Euphydryas gillettii

McCoy, R. C., Garud, N. R., Kelley, J. L., Boggs, C. L. and Petrov, D. A. (2013), Genomic inference accurately predicts the timing and severity of a recent bottleneck in a nonmodel insect population. Molecular Ecology. doi: 10.1111/mec.12591