# A Snapshot of #bigdatamed tweets and Facebook Posts

As you can see from the curated Twitter and Facebook feed included at the link above, Stanford’s Center for Computational, Evolutionary and Human Genomics (CEHG) had a strong presence at this year’s Big Data Conference. In addition to the numerous CEHG community members in the audience (and livestreaming the event), we also had CEHG faculty members Russ Altman, Euan Ashley, Carlos Bustamante, Mildred Cho, Hank Greely, Susan Holmes, and Julia Salzman on the stage, serving as session moderators and featured speakers

# Afterword: CEHG Genetics and Society Symposium 2015

Founded in 2012, CEHG is a research program that fosters interdisciplinary research. Home to more than 25 faculty and more than 200 grads and postdocs, CEHG bridges the divides between various member labs across Stanford campus.

The 2015 CEHG Genetics and Society Symposium (GSS15), which took place on April 13th and 14th in Stanford’s Paul Brest Hall, was a smashing success. It featured 25 speakers from Stanford campus and the San Francisco Bay academic and scientific industry communities. Approximately 175 Stanford affiliates and non-affiliates came together to celebrate the Center’s spirit of interdisciplinary collaboration and meet with experts in the fields of computational, evolutionary and human genomics This is a significant increase from last year’s 150 attendees!

The Mission:

The Genetics and Society Symposium is integral to CEHG’s mission: it provides postdocs and graduate fellows with the opportunity to share their developing research with faculty advisors and their colleagues, encourages conversation between faculty working in diverse scientific disciplines across campus, and introduces CEHG members to speakers from around the Bay Area and beyond (and vice versa).

The Venue:

As you can see from our photos of the space and catering service, Paul Brest Hall was the perfect home for this year’s two-day symposium. The hall was spacious, the food delicious, the staff hands on, and the outdoor picnic area well suited for our lunch and coffee breaks. We enjoyed the venue so much, in fact, that CEHG staff are currently in the process of booking the space for next year!

The Speakers:

GSS15 featured four brilliant keynote speakers, each distinguished in his/her field of research.

Gene Myers and CEHG Exec Committee members Marc Feldman, Chiara Sabatti, and Carlos Bustamante

Founding director of a new Systems Biology Center at the Max-Planck Institute of Molecular Cell Biology and Genetics, Dr. Eugene (Gene) Myers presented his open-sourced research on the resurrection of de novo DNA sequencing. Best known for the development of BLAST, the most widely used tool in bioinformatics and the assembler he developed at Celera that delivered the fly, human, and mouse genomes in a three-year period, Dr. Myers participated in GSS15, courtesy of DNAnexus. Follow his blog: https://github.com/thegenemyers.

Co-founding director Carlos Bustamante and Ed Green catch up during a break at GSS15.

Assistant Professor in Biomolecular Engineering at the University of California, Santa Cruz, Richard (Ed) Green presented his research on a novel approach for highly contiguous genome assemblies, which draws on his work as an NSF Fellow at the Max Planck Institute in Leipzig, Germany and head of an analysis consortium responsible for publishing the draft genome sequence of Neanderthal. Click here for his 2014 CARTA talk, “The Genetics of Humanness: The Neanderthal and Denisovan Genomes.

Dr. Michelle Mello, Stanford Law School and School of Medicine

Michelle Mello, Professor of Law at Stanford Law School and Professor of Health Research and Policy in Stanford’s School of Medicine, presented findings from her extensive research on the ethics of data sharing. As the author of more than 140 articles and book chapters on the medical malpractice system, medical errors and patient safety, public health law, research ethics, the obesity epidemic, and pharmaceuticals, Dr. Mello provided a valuable perspective from the intersections of law, ethics, and health policy. Click here to read Dr. Mello’s SLS profile.

Dr. Ami Bhatt, Stanford Medicine

Ami Bhatt shared her passion for improving outcomes for patients with hematological malignancies in her talk, “Bugs, drugs, and cancer.” Best known for her recent work demonstrating the discovery of a novel bacterium using sequence-based analysis of a diseased human tissue, her research has been presented nationally and internationally and published in 2013 in the New England Journal of Medicine. Click here for links to Dr. Bhatt’s CAP profile and lab homepage.

We had a large group of CEHG faculty members at this year’s event, showcasing the cutting edge research being done in CEHG labs across Stanford campus and indicating considerable faculty commitment to ensuring the Center’s continuing success.

Our symposium would not be complete without our invited CEHG Fellows. These speakers were nominated by organizing committee members to present on topics relating to their CEHG-funded research projects. These young scholars embody CEHG’s continuing commitment to provide funding support to researchers as they transition from graduate studies to postdoctoral scholarships.

The Workshop:

There was standing room only when facilitators Chiara Sabatti (Associate Professor of Health Research and Policy at Stanford), Ken Lange (Chair of the Human Genetics Department at UCLA), and Suyash Shringarpure (postdoctoral scholar in Stanford’s Bustamante Lab) presented their approaches to contemporary problems in statistical genetics!

Social Media:

Did you know? CEHG is on social media!

GSS15 social media moderators, Bridget Algee-Hewitt, Jeremy Hsu, Katie Kanagawa, and Rajiv McCoy were posting live throughout both days of the event. And our efforts to reach the larger community paid off, with a total reach of 815 on Facebook and more than 7,000 impressions on Twitter!

To catch up on our GSS15 coverage, check out our Facebook page at https://www.facebook.com/StanfordCEHG?ref=hl and our Twitter feed @StanfordCEHG. Follow both to make sure you are the first to know when we post CEHG-related news and announcements.

Want to know when speaker videos from the symposium will be available on CEHG’s forthcoming youtube channel? Follow us on Facebook and Twitter!

Special Thanks:

From left to right: Bridget Algee-Hewitt, Cody Sam, Yang Li, Anand Bhaskar, and Katie Kanagawa

The GSS15 organizing committee—including Bridget Algee-Hewitt, Anand Bhaskar, Katie Kanagawa, Yang Li, and Cody Sam—would like to take this opportunity to thank CEHG Directors Carlos Bustamante and Marc Feldman, Executive Committee members Hank Greely, Dmitri Petrov, Noah Rosenberg, and Chiara Sabatti, event volunteers Alex Adams, Maude David, and Chris Gignoux, event photographer Deneb Semprum, and everyone who attended this year’s symposium.

We hope you enjoyed attending as much as we enjoyed working behind-the-scenes. We hope to see you all again at GSS16! If you are interested in volunteering for future CEHG events, please contact us at stanfordcehg@stanford.edu.

Upcoming CEHG events:

Don’t miss our popular weekly Evolgenome seminar series, which will continue through Spring term, usually on Wednesdays at noon (location varies). Lunch is always provided. Details will follow, but here is a quick overview so you can mark your calendars!

April 29: Fernando Racimo (Nielsen/Slatkin Lab)
May 6: Pleuni Pennings (UCSF)
May 20: Kelly Harkin
June 3: Sandeep Ventakaram (Petrov Lab)
June 10: Emilia Huerta-Sanchez

# What’s Sardinia got to do with it? Ancient and modern genomes shed light on the genetic structure of Europe.

Blog author Yuan Zhu, formerly a PhD student in the Petrov lab, is now a Research Fellow at the Genome Institute of Singapore.

The Neolithic Revolution is the oldest documented agricultural revolution in human history. More than just the domestication of certain crops and animals, it describes a critical time in human history when hunter-gatherer groups transitioned into sedentary farming communities. This drastic change in lifestyle led to a major shift in living conditions and cultural practices, setting up the necessary prerequisites to support the kind of population density eventually possible in modern society.

In Central Europe, the Neolithic Revolution is thought to have taken place around 8,000-4,000 BC. Historians have long wondered about how farming was introduced and spread across the continents. Was the new practice brought in as novel ideas incorporated by local communities? Did new immigrants bring their lifestyle with them, possibly outcompeting existing hunter-gatherers and eventually displacing them all together? Was it perhaps even more complicated? What happened after?

## What Ötzi can tell us

Ancient human remains from around the time of the revolution can yield some insight. Ötzi the Tyrolean Iceman, a 5,300-year-old natural mummy found frozen in the Alps on the border of Italy and Austria, was recently shown (by a group that included CEHG researchers Martin Sikora and Carlos Bustamante) to belong to a Y-chromosome lineage mostly found in contemporary Sardinia [1]. This was surprising information. The Iceman’s life was spent in a narrow range within 60 km of his site of discovery [2]. He was unequivocally local, and clearly a farmer. Yet his lineage has since disappeared from Central Europe, suggesting that demographic scenarios were more complex than expected, and that at some point this Sardinian-like ancestry may have spanned Neolithic Europe.

A). The location of the discovery sites of ancient individuals studied, with hunter-gatherers (HG) represented as circles, and farming (F) individuals represented as squares. B). ADMIXTURE results of modern populations on the left panel, and inferred genetic composition of ancient individuals on the right. [Adapted from Figure 1, Sikora et al. 2014.]

## Sardinia: a genetic snapshot of the Neolithic?

In a recent paper published in PLOS Genetics, Sikora and colleagues sought to address this hypothesis by making full use of recent advancements in the sequencing of nuclear ancient DNA [3]. However, the Iceman alone was not sufficient to represent a continent. Ancient DNA sequences from six individuals from across Europe, including both farmer and hunter-gatherer individuals, were analyzed by the authors in order to paint a clearer picture of the demographics of Neolithic Europe. Two of the farmers were found in Bulgaria and were previously sequenced using an ancient DNA capture method developed by Sikora’s colleague in the Bustamante lab, Meredith Carpenter [4, and see blog post here]. In addition, Sikora made use of contemporary population SNP data, including sequence data from over 400 modern Sardinians, to provide a solid reference from which to estimate the true genetic affiliation of these ancient humans.

Some of the most interesting results from the analysis came from contrasts between the farmers (Iceman, gok4, and P192-1), the hunter-gatherers (ajv7 and brana1), and modern-day European populations. When the authors applied the clustering algorithm ADMIXTURE to the data, they found that the farmer individuals had significant portions of shared ancestry with modern Sardinians (Southern Europe), a characteristic largely absent in the HG individuals, who showed mainly Northern European (Basque) and Russian affiliated ancestry. Principal component analysis (PCA) and a statistic called the D-test agreed with high confidence—hunter-gatherers looked more Northern European, whereas farmers seemed more Sardinian than any other European group tested. TreeMix, a program that models population splits while allowing for admixture between branches, provided a similar answer when applied to the data from 1000 Genomes and the modern Sardinians, and further suggested a possible admixture scenario involving at least three major events, all of which falls neatly in line with previous work.

Taken together, the data support the authors’ original hypothesis—Sardinian-like ancestry was probably once common in Neolithic Europe. The Iceman, gok4, and P192-1 were discovered in very different locations, and P192-1 in particular was 2,000 years younger than the others, making it even more unlikely that all three were recurrent immigrants from Sardinia (which was thought to be uninhabited by hunter-gatherers prior to the Neolithic), and further suggesting that the lineage may have persisted for a while on the continent. In fact, Sikora and colleagues propose that Sardinia is a “modern-day ‘snapshot’ of the genetic structure of the people associated with the spread of agriculture in Europe.”

A proposed, highly simplified version of recent European demographic history. A). Early hunter-gatherers (closest to modern day Russian/Basque) were B). heavily influenced by an influx of farmers C) who spread across all of Europe and into Sardinia D). and subsequently maintained only in Sardinia due to genetic isolation. [Adapted from Figure 4, Sikora et al. 2014]

## Bridging the past and the future with ancient DNA

From here, the story is far from over. In fact, it only gets more complicated, and more work remains to be done. While a simplified model was proposed, the authors note that multiple sources of evidence suggest a far more complex and nuanced recent demographic history for Europe that we have yet to untangle. There are issues with ancient DNA sequences, such as characteristic DNA damage patterns, that are unique to the nature of the data. Potential issues with current methods being unable to handle such underlying patterns forced the authors to analyze every ancient DNA sample against modern populations individually. As with every advance in sequencing technology, with ancient DNA sequencing getting more accurate and accessible, new analytical methods must be developed to take full advantage of the data.

## References

Paper author Martin Sikora was a postdoctoral fellow in Carlos Bustamante’s lab. He is now a group leader at the Center for GeoGenetics in Copenhagen, Denmark.

# A fast and accurate coalescent approximation

Blog author Suyash Shringarpure is a postdoc in Carlos Bustamante’s lab. Suyash is interested in statistical and computational problems involved in the analysis of biological data.

The coalescent model is a powerful tool in the population geneticist’s toolbox. It traces the history of a sample back to its most recent common ancestor (MRCA) by looking at coalescence events between pairs of lineages. Starting from assumptions of random mating, selective neutrality, and constant population size, the coalescent uses a simple stochastic process that allows us to study properties of genealogies, such as the time to the MRCA and the length of the genealogy, analytically and through efficient simulation. Extensions to the coalescent allow us to incorporate effects of mutation, recombination, selection and demographic events in the coalescent model. A short introduction to the coalescent model can be found here and a longer, more detailed introduction can be read here.

However, coalescent analyses can be slow or suffer from numerical instability, especially for large samples. In a study published earlier this year in Theoretical Population Biology, CEHG fellow Ethan Jewett and CEHG professor Noah Rosenberg proposed fast and accurate approximations to general coalescent formulas and procedures for applying such approximations. Their work also examined the asymptotic behavior of existing coalescent approximations analytically and empirically.

## Computational challenges with the coalescent

For a given sample, there are many possible genealogical histories, i.e., tree topologies and branch lengths, which are consistent with the allelic states of the sample. Analyses involving the coalescent therefore often require us to condition on a specific genealogical property and then sum over all possible genealogies that display the property, weighted by the probability of the genealogy. A genealogical property that is often conditioned on is $n_t$, the number of ancestral lineages in the genealogy at a time $t$ in the past. However, computing the distribution $P(n_t)$ of $n_t$ is computationally expensive for large samples and can suffer from numerical instability.

## A general approximation procedure for formulas conditioning on $n_t$

Coalescent formulas conditioning on $n_t$ typically involve sums of the form $f(x)=\sum_{n_t} f(x|n_t) \cdot P(n_t)$

For large samples and recent times, these computations have two drawbacks:

–       The range of possible values for $n_t$ may be quite large (especially if multiple populations are being analyzed) and a summation over these values may be computationally expensive.

–       Expressions for $P(n_t)$ are susceptible to round-off errors.

Slatkin (2000) proposed an approximation to the summation in $f(x)$ by a single term $f(x|E[n_t])$. This deterministic approximation was based on the observation that $n_t$ changes almost deterministically over time, even though it is a stochastic variable in theory. Thus we can write $n_t \approx E[n_t]$. From Figure 2 in the paper (reproduced here), we can see that this approximation is quite accurate. The authors prove the asymptotic accuracy of this approximation and also prove that under regularity assumptions, $f(x|E[n_t])$ converges to $f(x)$ uniformly in the limits of $t \rightarrow 0$ and $t \rightarrow \infty$ . This is an important result since it shows that the general procedure produces a good approximation for both very recent and very ancient history of the sample. Further, the paper shows how this method can be used to approximate quantities that depend on the trajectory of $n_t$ over time, which can be used to calculate interesting quantities such as the expected number of segregating sites in a genealogy.

## Approximating $E[n_t]$ for single populations

A difficulty with using the deterministic approximation is that $E[n_t]$ often has no closed-form formula, and if one exists, it is typically not easy to compute when the sample is large.

For a single population with changing size, two deterministic approximations have previously been developed (one by Slatkin and Rannala 1997, Volz et al. 2009 and one by Frost and Volz, 2010, Maruvka et al., 2011). Using theoretical and empirical methods, the authors examine the asymptotic behavior and computational complexity of these approximations and a Gaussian approximation by Griffiths. A summary of their results is in the table below.

 Method Accuracy Griffith’s approximation Accurate for large samples and recent history. Slatkin and Rannala (1997), Volz et al. (2009) Accurate for recent history and arbitrary sample size, inaccurate for very ancient history. Frost and Volz (2010), Maruvka et al. (2011) Accurate for both recent and ancient history and for arbitrary sample size. Jewett and Rosenberg (2014) Accurate for both recent and ancient history and arbitrary sample size, and for multiple populations with migration.

## Approximating $E[n_t]$ for multiple populations

Existing approaches only work for single populations of changing size and cannot account for migration between multiple populations. Ethan and Noah extend the framework for single populations to allow multiple populations with migration. The result is a system of simultaneous differential equations, one for each population. While it does not allow for analytical solutions except in very special cases, the system can be easily solved numerically for any given demographic scenario.

## Significance of this work

The extension of the coalescent framework to multiple populations with migration is an important result for demographic inference. The extended framework with multiple populations allows efficient computation of demographically informative quantities such as the expected number of private alleles in a sample, divergence times between populations.

Ethan and Noah describe a general procedure that can be used to approximate coalescent formulas that involve summing over distributions conditioned on $n_t$ or the trajectory of $n_t$ over time. This procedure is particularly accurate for studying very recent or very ancient genealogical history.

The analysis of existing approximations to $E[n_t]$ show that different approximations have different asymptotic behavior and computational complexities. The choice of which approximation to use is therefore often a tradeoff between the computational complexity of the approximation and the likely behavior of the approximation in the parameter ranges of interest.

## Future Directions

As increasingly large genomic samples from populations with complex demographic histories become available for study, exact methods either become intractable or very slow. This work adds to a growing set of approximations to the coalescent and its extensions, joining other methods such as conditional sampling distributions and the sequentially markov coalescent. Ethan and Noah are already exploring applications of these approximate methods to reconciling gene trees with species trees. In the future, I expect that these and other approximations will be important for fast and accurate analysis of large genomic datasets.

## References

[1] Jewett, E. M., & Rosenberg, N. A. (2014). Theory and applications of a deterministic approximation to the coalescent model. Theoretical population biology.

[2] Griffiths, R. C. (1984). Asymptotic line-of-descent distributions. Journal of Mathematical Biology21(1), 67-75.

[3] Frost, S. D., & Volz, E. M. (2010). Viral phylodynamics and the search for an ‘effective number of infections’. Philosophical Transactions of the Royal Society B: Biological Sciences365(1548), 1879-1890.

[4] Maruvka, Y. E., Shnerb, N. M., Bar-Yam, Y., & Wakeley, J. (2011). Recovering population parameters from a single gene genealogy: an unbiased estimator of the growth rate. Molecular biology and evolution28(5), 1617-1631.

[5] Slatkin, M., & Rannala, B. (1997). Estimating the age of alleles by use of intraallelic variability. American journal of human genetics60(2), 447.

[6] Slatkin, M. (2000). Allele age and a test for selection on rare alleles.Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences355(1403), 1663-1668.

[7] Volz, E. M., Pond, S. L. K., Ward, M. J., Brown, A. J. L., & Frost, S. D. (2009). Phylodynamics of infectious disease epidemics. Genetics183(4), 1421-1430.

Paper author Ethan Jewett is a PhD student in the lab of Noah Rosenberg.

# Learning from 69 sequenced Y chromosomes

## Why the Y?

Blog author Amy Goldberg is a graduate students in Noah Rosenberg’s lab.

While mitochondria have been extensively sequenced for decades because of their short length and abundance, the Y chromosome has been under-studied.  Unlike autosomal DNA, the mitochondria and (most of) the Y chromosome are inherited exclusively maternally and paternally, respectively.  Therefore, they do not undergo meiotic recombination.  Without recombination, mutations accumulate on a stable background, preserving a wealth of information about population history.  Each background, shared through a common ancestor, is called a haplogroup. To leverage this information, Poznik et al. set out to sequence 69 males from nine diverse human populations, including a large representation of African individuals.  The paper, published in Science last summer, is by Stanford graduate student David Poznik and a group lead by CEHG professor Dr. Carlos Bustamante.

The structure of the Y chromosome is complex, with large heterochromatic regions, pseudo-autosomal regions that recombine with the X chromosome, and repetitive elements, making mapping reads difficult.  But, the Y chromosome is haploid, allowing for accurate variant calls at lower coverage than the autosomes, which have heterozygotes.  Using high-throughput sequencing (3.1x mean coverage) and a haploid expectation-maximization algorithm, Poznik et al. called genotypes with an error rate around 0.1%. The paper developed important methods for analyzing high-throughput sequences of the difficult Y chromosome, including determining the subset of regions within which accurate genotypes can be called.

## Reconstructing the human Y-chromosome tree

Poznik et al. constructed a phylogenetic tree of the Y chromosome using sequence data and a maximum likelihood approach.  While the overall structure of the tree was known, Poznik et al. were able to accurately calculate branch lengths based on the number of variants differing between individuals and resolve previously indeterminate finer structure.

Figure 2 of the paper: Y-chromosome phylogeny inferred from genomic sequencing.

Incredible African Diversity: One of the key findings of the paper was the depth of diversity within Africans lineages.  While both uniparental and autosomal markers have indicated an African root for human diversity, Poznik et al. find lineages within a single population, the San hunter-gatherers, that coalesce almost at the same time as the entire tree (see haplogroup A). This indicates African diversity and structure has existed for tens of thousands of years, and there is likely more to discover.  A large sample of African populations were considered, which lead to previously unseen structure within haplogroup B2, including structure not mirrored by modern population clustering, that dates to approximately 35,000 years ago.

Evidence of population expansionShort internal branches of the tree, such as those seen within haplogroup E and the non-African group FT, indicate periods of rapid population growth.  When a population expands quickly, new variants that might otherwise drift to extinction can persist.  A large number of coalescence events occur at the time of growth, as there were fewer lineages alive in the population before this time.  For non-African haplogroups, this pattern is likely a remnant of the Out of Africa migration.  For haplogroup E, this corresponds to the Bantu agricultural expansion.

Resolved Eurasian polytomy: Previously, the topology of the Eurasian tree separating haplogroups G-H-IJK was unresolved.  Because of the higher coverage sequencing for this study, Poznik et al. found a single variant, a C to T transition, that differentiates G from the other groups.  Haplogroup G retains the ancestral variant, while H-IJK share the derived variant and are therefore more closely related to each other.

## Sequencing vs. genotyping

In contrast to previous studies, which analyzed small repetitive elements called microsatellites or small sets of single base-pair changes called SNPs, whole-genome sequencing data contains not only more information, but potentially more accurate information.  In particular, before the advent of high-throughput sequencing, SNPs were usually ascertained in a subset of individuals that did not capture worldwide diversity levels.  Therefore, diversity measures are often underestimated and biased.  Without sequence data, the branch lengths of the tree did not have a meaningful interpretation, and the depth of variation within Africa was not seen.

## MRCA of Human Maternal and Paternal Lineages

There was a lot of public discussion spurred by the publication of Poznik’s paper last year.  The discussion mainly focused on their result that, contrary to previous estimates, the most recent common ancestor (MRCA) of all mitochondrial DNA lived at a similar time as that of all Y chromosomes.  Previous estimates put the mitochondrial TMRCA around 200 thousand years ago, with the Y chromosome coalescing a bit over 100 thousand years ago.  These different estimates for Y and mitochondria were often obtained through different sequencing and analysis methods, and are therefore less comparable.  In particular, varying estimates of the mutation rates have led to different TMRCA estimates.  By analyzing both the Y and mitochondria in the same framework, calibrated by archeological evidence and within-species comparisons, Poznik et al. found largely overlapping confidence intervals for the TMRCA of both Y and mitochondria.

But, should the coalescence times of the mitochondria and the Y chromosome be the same? Not necessarily.  While discrepancies between the mitochondria and Y chromosome have often been interpreted as sex-biased population histories or sizes, strictly neutral models can predict large differences between the two, as well.  Because neither the analyzed part of the Y chromosome nor the mitochondria undergo recombination, each acts as a single locus – and therefore represents the history of a single lineage.  For a population, there is a wide distribution of the ages when lineages would coalesce for a given population history, and these loci represent only two with largely independent histories (given the overall population history), therefore they may differ by chance alone.  Similarly, different loci across autosomal DNA have TMRCA ranging from thousands to millions of years. Additionally, as single loci, any effects of selection would distort the entire genealogy of the Y chromosome and mitochondria.

## Future directions

Human population history is far from fully fleshed out, and Poznik et al. provide a framework to leverage increasingly available high-throughput sequencing of Y chromosomes.  The method used to calculate the mutation rate and TMRCA is a valuable contribution in itself, with applications to a wide range of evolutionary and ecological questions.  This study demonstrated that we have only characterized a fraction of worldwide diversity, particularly in Africa, and that increased sampling will be critical to parsing close and far ties in human history.

## Reference

Poznik GD, Henn BM, Yee MC, Sliwerska E, Euskirchen GM, Lin AA, Snyder M, Quintana-Murci L, Kidd JM, Underhill PA, Bustamante CD. Sequencing Y chromosomes resolves discrepancy in time to common ancestor of males versus females. Science. 2013 Aug 2;341(6145):562-5. doi: 10.1126/science.1237619.

Paper author David Poznik is a PhD student in Carlos Bustamante’s lab.

# Demographic inference from genomic data in nonmodel insect populations

Blog author Martin Sikora is a postdoc in the lab of Carlos Bustamante.

Reconstructing the demographic history of species and populations is one of the major goals of evolutionary genetics. Inferring the timing and magnitude of past events in the history of a population is not only of interest in its own right, but also in order to form realistic null models for the expected patterns of neutral genetic variation in present-day natural populations. A variety of methods exist that allow the inference of these parameters from genomic data, which, in the absence of detailed historical records in most situations, is often the only feasible way to obtain them. As a consequence, it is generally not possible to empirically validate the parameters inferred from genomic data in a direct comparison with a known “truth” from a natural population. Furthermore, until recently, the application of these methods was limited to model organisms with well-developed genomic resources (e.g., humans and fruitflies), excluding a large number of non-model organisms with potentially considerable evolutionary and ecological interest.

## Chasing butterflies?

In an elegant study recently published in the journal Molecular Ecology, Rajiv McCoy, a graduate student with Dmitri Petrov and Carol Boggs, and colleagues tackle both of these problems in natural populations of Euphydryas gillettii, a species of butterfly native to the northern Rocky Mountains. About 30 years ago, a small founder population of this species from Wyoming was intentionally introduced to a new habitat at the Rocky Mountain Biological Laboratory field site in Colorado, and population sizes were recorded every year since the introduction. The beauty of this system is that it allows the authors to perform a direct comparison of the known demography (i.e. a recent split from the parental population and bottleneck ~30 generations ago, with census data in the newly introduced population) with estimates inferred from genomic data.

Gillete’s Checkerspot (Euphydryas gillettii). Photo taken by Carol Boggs, co-advisor of Rajiv and one of the senior authors of the study.

## A genomic dataset from a non-model organism

The researchers sampled eight larvae each from both the parental as well as the derived population for this study. In the world of model organisms, the next steps for constructing the dataset would be straightforward: Extract genomic DNA, sequence to the desired depth, map to the reference genome and finally call SNPs. In the case of E. gillettii however, no reference genome is available, so the authors had to use a different strategy. They decided to use RNA-sequencing in order to first build a reference transcriptome, which was then used as a reference sequence to map against and discover single nucleotide variants. An additional advantage of this approach is that the data generated can potentially also be utilized for other types of research questions, such as analyses of gene expression differences between the populations. On the downside, SNP calling from a transcriptome without a reference genome is challenging and can lead to false positives, for example due to reads from lowly expressed paralogs erroneously mapping to the highly expressed copy present in the assembled transcriptome. The authors therefore went to great lengths to stringently filter these false positive variants from their dataset.

## Demographic inference using δαδι

For the demographic inference, McCoy and colleagues used δαδι (diffusion approximation for demographic inference), a method developed by Ryan Gutenkunst while he was a postdoc in the group of CEHG faculty member Carlos Bustamante. This method uses a diffusion approximation to calculate the expected allele frequency spectrum under a demographic model of interest. The observed allele frequency spectrum is then fit to the expected spectrum by optimization of the demographic parameters to maximize the likelihood of the data. δαδι has been widely used to infer the demographic history of a number of species, from humans to domesticated rice, and is particularly suited to large-scale genomic datasets due to its flexibility and computational efficiency.

Excerpt of Figure 2 from McCoy et al., illustrating the demographic models tested using δαδι.

## Models vs History

The authors then fit a demographic model reflecting the known population history of E. gillettii, as illustrated in Figure 2 of their article (Model A). Encouragingly, they found that the model provided a very good fit to the data, with an the estimate of the split time between 40 and 47 generations ago, which is very close to the known time of establishment of the Colorado population 33 generations ago. Furthermore, they also tested how robust these results were to using a misspecified demographic model, by incorporating migration between the Colorado and Wyoming populations in their model (which in reality are isolated from each other). However, both alternative models with migration (Models B1 and B2) did not significantly improve the fit, again nicely consistent with the known population history.

## Three butterflies is enough?

Finally, the researchers also tested the robustness of the results to variations in the number of samples or SNPs used in the analysis, from datasets simulated under the best-fit model A. They found that δαδι performed remarkably well even with sample sizes as low as three individuals per population. While this is in principle good news for researchers limited by low number of available samples, one has to be aware of the fact that this results will be to a certain extent specific to this particular type of system, where one population undergoes a very strong bottleneck resulting in large effects on the allele frequency spectrum. A good strategy suggested by McCoy and colleagues is then to use these types of simulations in the planning stages of an experiment, in order to inform researchers of the number of samples and markers necessary to confidently estimate the demographic parameters of interest.

## Conclusions and future directions

For me, this study is a great example of how next-generation sequencing and sophisticated statistical modeling can open up a new world of possibilities to researchers interested in the ecology and evolution of natural populations. McCoy and colleagues constructed their genomic dataset essentially from scratch, without the “luxuries” of a reference genome or database of known polymorphisms. Moving forward, Rajiv has been busy collecting more samples over the past year. He and his colleagues plan to sequence over a thousand of them for the next phase of the project, as well as assemble a reference genome for E. gillettii, and important next step in the development of genomic tools for this fascinating ecological system.

The author of the paper Rajiv McCoy, sampling larvae of Euphydryas gillettii

McCoy, R. C., Garud, N. R., Kelley, J. L., Boggs, C. L. and Petrov, D. A. (2013), Genomic inference accurately predicts the timing and severity of a recent bottleneck in a nonmodel insect population. Molecular Ecology. doi: 10.1111/mec.12591

# Genomic analyses of ancestry of Caribbean populations

Blog author Rajiv McCoy is a graduate student in the lab of Dmitri Petrov.

In the Author Summary of their paper, “Reconstructing the Population Genetic History of the Caribbean”, Andrés Moreno-Estrada and colleagues point out that Latinos are often falsely depicted as a homogeneous ethnic or cultural group.  In reality, however, Latinos, including inhabitants of the Caribbean basin, represent a diverse mixture of previously separate human populations, such as indigenous groups, European colonists, and West Africans brought over during the Atlantic slave trade.  This mixing process, which geneticists call “admixture”, left a distinct footprint on genetic variation within and between Caribbean populations.  By surveying genotypes of 330 Caribbean individuals and comparing to a database of variation from more than 3000 individuals from European, African, and Native American populations, Moreno et al., explore the genomic outcomes of this complex admixture process and reveal intriguing demographic patterns that could not be obtained from the historical record alone. The paper, featured in the latest edition of PLOS Genetics, represents a collaborative project with co-senior authorship by Stanford CEHG professor Carlos Bustamante and Professor Eden Martin from the University of Miami Miller School of Medicine.

Reconstructing the demographic history of admixed populations

Because parental DNA is only moderately shuffled before being incorporated into gametes (the process of meiotic recombination), admixture results in discrete genomic segments that can be traced to a particular ancestral population.  In early generations after the onset of admixture, these segments are large.  However, after many generations, segments will be quite small.  By investigating the distribution of sizes of these ancestry “tracts”, Moreno and colleagues inferred the timing of various waves of migration and admixture.  For Caribbean Island populations, they infer that European gene flow first occurred ~16-17 generations ago, which matches very closely to the historical record of ~500 years, assuming ~30 years per generation.  In contrast, for neighboring mainland populations from Colombia and Honduras, they find that European gene flow occurred in waves, starting more recently (~14 generations ago).

Identifying sub-continental ancestry of admixed individuals

Those familiar with human population genetics will recognize principal component analysis (PCA), which transforms a matrix of correlated observed genotypes into a set of uncorrelated variables where the first component explains the most possible variance, the second variable explains the second most variance, and so on.  Individuals’ transformed genotypes can be plotted on the first two principle components, and when performed on a worldwide scale, distinct clusters appear which represent populations of ancestry.  On conventional PCA plots, admixed individuals fall between their different ancestral populations, as they possess sets of genotypes diagnostic of multiple ancestral groups.  As virtually all Caribbean individuals are admixed to some degree, this pattern is apparent for Caribbean populations (see Figure 1B from the paper, reproduced below).

While interesting, this means that the sub-continental ancestry of these admixed individuals is difficult to ascertain.  An individual may want to know which Native American, West African, and European populations contribute to his or her ancestry, and this analysis does not have sufficient resolution to answer these questions.

Moreno and colleagues therefore devised a new version of PCA called ancestry-specific PCA (ASPCA), which extracts genomic segments assigned to Native American, West African, and European ancestry, then analyzes these segments separately, dealing with the large proportions of missing data that result.  In the case of Native American ASPCA, they observe two overlapping clusters.  The first represents mostly Colombians and Hondurans, who cluster most closely with indigenous groups from Western Colombia and Central America and have a greater overall proportion of Native American ancestry.  The second cluster represents mostly Cubans, Dominicans, and Puerto Ricans, who cluster most closely with Eastern Colombian and Amazonian indigenous groups.  This makes sense in light of the fact that Amazonian populations from the Lower Orinoco Valley settled on rivers and streams, which could have facilitated their migration.  Because indigenous ancestry proportions were relatively consistent and closely clustered across different Caribbean Islands, the authors posit that there was a single pulse of expansion of Amazonian natives across the Caribbean prior to European arrival, along with gene flow among the islands.

In the case of European ASPCA, Moreno et al. found that Caribbean samples clustered closest to, but clearly distinct from, present day individuals from the Iberian Peninsula in Southern Europe.  In fact, the differentiation between this “Latino-specific component” and Southern Europe is at least as great as the differentiation between Northern and Southern Europe.  The authors hypothesize that this is due to very small population sizes among European colonists, which would have introduced noise into patterns of genomic variation through the process of random genetic drift.

Finally, the authors demonstrate that Caribbean populations have a higher proportion of African ancestry compared to mainland American populations, a result of admixture during and after the Atlantic slave trade.  Surprisingly, the authors found that all samples tightly clustered with present day Yoruba samples from Nigeria rather than being dispersed throughout West Africa.  However, because other analyses suggested that there might have been two major waves of migration from West Africa, the authors decided to analyze “old” and “young” blocks of African ancestry separately.  This analysis revealed that “older” segments are primarily derived from groups from the Senegambia region of Northwest Africa, while “younger” segments likely trace to groups from the Gulf of Guinea and Equatorial West Africa (including the Yoruba).

Conclusions and perspectives

This groundbreaking study has immediate implications for the field of personalized medicine, especially due to the discovery of a distinct Latino-specific component of European ancestry.  The hypothesis that European colonists underwent a demographic bottleneck (a process termed the “founder effect”) has expected consequences for the frequency of damaging mutations contributing to genetic disease. The observation of extensive genetic differences among Caribbean populations also argues for more such studies characterizing genetic variation on a smaller geographic scale. The newly developed ASPCA method will surely be valuable for other admixed populations.  In addition to medical implications, studies such as this help dispel simplistic notions of race and ethnicity and inform cultural identities based on unique and complex demographic history.

Citation: Moreno-Estrada A, Gravel S, Zakharia F, McCauley JL, Byrnes JK, et al. (2013) Reconstructing the Population Genetic History of the Caribbean. PLoS Genet 9(11): e1003925. doi:10.1371/journal.pgen.1003925

Paper author Andrés Moreno-Estrada is a research associate in the lab of Carlos Bustamante.

Blog-author: Sandeep Venkataram is a grad student in Dmitri Petrov’s lab.

By Sandeep Venkataram – Modern humans have been evolving independently of our nearest living relatives, chimpanzees, for over 7 million years. To study our evolutionary history since this divergence, our major source of information is from the fossilized remains of our ancestors and closely related species such as Neanderthals. Physiological information from the remains can tell us a lot about human evolution, but the majority of the information is locked up in the tiny amounts of highly degraded and fragmented DNA left from the specimen.

Studying ancient DNA (aDNA) from bones is extremely challenging. Each sample contains not only the fossil’s DNA, but contaminating DNA from enormous numbers of microbes and other organisms. There is also the possibility of human contamination from handling the fossil. Since the contaminants have much higher quality DNA than the endogenous sample, simply processing a raw DNA extract of the sample typically yields sequencing results with less than 1% aDNA. Sequencing such a sample is incredibly inefficient, as less than 1% of the sequence is actually useful and only the best funded studies can afford the sequencing costs to generate a high coverage genome. Therefore, most aDNA studies tend to focus on mitochondrial DNA using targeted capture methods, or PCR amplicon sequencing. This reduces wasted sequence capacity, but greatly limits the amount of information obtained from the sample.

## Getting the most out of ancient DNA

Meredith Carpenter, who is a postdoc here at Stanford, and her colleagues have developed a novel method called whole genome in-solution capture (WISC) to purify aDNA from next-gen sequencing libraries generated from fossil DNA samples. The authors make use of RNA bait technology (Figure 1), which uses RNA complementary to the targeted DNA that has been synthesized using biotinylated nucleotides (Gnirke et al 2009). The RNA can be hybridized to the DNA pool, and purified from solution by linking the biotin to commercially available purification beads. By then eluting the hybrids from the beads and removing the RNA from solution, one can greatly enrich for the targeted DNA. Previous methods using RNA baits required synthesizing DNA strands on microarrays that are complementary to the RNA, then inducing transcription in vitro to generate the RNA bait library. This methodology has been successfully used to capture the entirety of chromosome 21 (Fu et al 2013) from human fossils, but is not cost effective for producing a bait library that covers the entire human genome.

Meredith Carpenter and coauthors circumvent this by mechanically fragmenting human genomic DNA and use blunt end ligation to attach adapter sequences containing an RNA polymerase promoter. This modified DNA library can then be used to generate the biotinylated RNA bait library via in vitro transcription, after which the RNA library is purified using standard protocols. The authors also generate non-biotinylated RNA complementary to the adapter sequences present on every DNA fragment in the library, to block nonspecific binding to the adapters. The bait and block RNA libraries are hybridized to the aDNA libraries, purified using beads to select only the aDNA fragments and sequenced. The cost of their enrichment method is estimated at \$50 per sample, and is accessible to most labs conducting aDNA genomic studies.

## WISC greatly enriches for ancient DNA across a variety of samples

The authors tested this method on a variety of aDNA libraries prepared from both high quality and low quality samples, including hair remains, teeth and bones, and fossils from tropical and more temperate regions, which can greatly influence DNA quality. They sequenced the libraries both before and after WISC, and found a 3-13x increase in the number of uniquely mapped reads after using WISC. In addition, most of the unique reads in the enriched library are sequenced with five million reads in both hair and bone samples. WISC allows most of the endogenous sequence to be read from dozens of aDNA samples in a single lane of Illumina HiSeq, opening the possibility of sequencing the millions of fossils in museums and collections around the world.

Dr. Carpenter says they are now focusing on adapting the method to removing human DNA contamination from microbiome sequencing projects with promising preliminary results, as well as applications in forensics and studying extinct species. As WISC generates the RNA bait library from genomic DNA of an extant relative instead of synthetic DNA arrays, bait libraries can be prepared regardless of whether the genome of the organism that is the source of the bait library is known. Combined with recent advances in aDNA library construction methods (Meyer et al 2012), WISC promises to make sequencing of contaminated and degraded samples widely accessible.

Carpenter et al (2013) Figure 1. Schematic of the Whole-Genome In-Solution Capture Process
To generate the RNA “bait” library, a human genomic library is created via adapters containing T7 RNA polymerase promoters (green boxes). This library is subjected to in vitro transcription via T7 RNA polymerase and biotin-16-UTP (stars), creating a biotinylated bait library. Meanwhile, the ancient DNA library (aDNA “pond”) is prepared via standard indexed Illumina adapters (purple boxes). These aDNA libraries often contain <1% endogenous DNA, with the remainder being environmental in origin. During hybridization, the bait and pond are combined in the presence of adaptor-blocking RNA oligos (blue zigzags), which are complimentary to the indexed Illumina adapters and thus prevent nonspecific hybridization between adapters in the aDNA library. After hybridization, the biotinylated bait and bound aDNA is pulled down with streptavidin-coated magnetic beads, and any unbound DNA is washed away. Finally, the DNA is eluted and amplified for sequencing.

Paper author Meredith Carpenter is a postdoc in Carlos Bustamante’s lab.

## References

Carpenter, M. L., Buenrostro, J. D., Valdiosera, C., Schroeder, H., Allentoft, M. E., Sikora, M., Rasmussen, M., et al. (2013). Pulling out the 1%: Whole-Genome Capture for the Targeted Enrichment of Ancient DNA Sequencing Libraries. The American Journal of Human Genetics, 1–13. doi:10.1016/j.ajhg.2013.10.002

Fu, Q., Meyer, M., Gao, X., Stenzel, U., Burbano, H. a, Kelso, J., & Pääbo, S. (2013). DNA analysis of an early modern human from Tianyuan Cave, China. Proceedings of the National Academy of Sciences of the United States of America, 110(6), 2223–7. doi:10.1073/pnas.1221359110

Gnirke, A., Melnikov, A., Maguire, J., Rogov, P., LeProust, E. M., Brockman, W., Fennell, T., et al. (2009). Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nature biotechnology, 27(2), 182–9. doi:10.1038/nbt.1523

Meyer, M., Kircher, M., Gansauge, M.-T., Li, H., Racimo, F., Mallick, S., Schraiber, J. G., et al. (2012). A high-coverage genome sequence from an archaic Denisovan individual. Science (New York, N.Y.), 338(6104), 222–6. doi:10.1126/science.1224344

Update: The second paragraph originally talked about fossils, but it should have been bones (now corrected). A fossil is mineralized and would not yield aDNA.