# A fast and accurate coalescent approximation

Blog author Suyash Shringarpure is a postdoc in Carlos Bustamante’s lab. Suyash is interested in statistical and computational problems involved in the analysis of biological data.

The coalescent model is a powerful tool in the population geneticist’s toolbox. It traces the history of a sample back to its most recent common ancestor (MRCA) by looking at coalescence events between pairs of lineages. Starting from assumptions of random mating, selective neutrality, and constant population size, the coalescent uses a simple stochastic process that allows us to study properties of genealogies, such as the time to the MRCA and the length of the genealogy, analytically and through efficient simulation. Extensions to the coalescent allow us to incorporate effects of mutation, recombination, selection and demographic events in the coalescent model. A short introduction to the coalescent model can be found here and a longer, more detailed introduction can be read here.

However, coalescent analyses can be slow or suffer from numerical instability, especially for large samples. In a study published earlier this year in Theoretical Population Biology, CEHG fellow Ethan Jewett and CEHG professor Noah Rosenberg proposed fast and accurate approximations to general coalescent formulas and procedures for applying such approximations. Their work also examined the asymptotic behavior of existing coalescent approximations analytically and empirically.

## Computational challenges with the coalescent

For a given sample, there are many possible genealogical histories, i.e., tree topologies and branch lengths, which are consistent with the allelic states of the sample. Analyses involving the coalescent therefore often require us to condition on a specific genealogical property and then sum over all possible genealogies that display the property, weighted by the probability of the genealogy. A genealogical property that is often conditioned on is $n_t$, the number of ancestral lineages in the genealogy at a time $t$ in the past. However, computing the distribution $P(n_t)$ of $n_t$ is computationally expensive for large samples and can suffer from numerical instability.

## A general approximation procedure for formulas conditioning on $n_t$

Coalescent formulas conditioning on $n_t$ typically involve sums of the form $f(x)=\sum_{n_t} f(x|n_t) \cdot P(n_t)$

For large samples and recent times, these computations have two drawbacks:

–       The range of possible values for $n_t$ may be quite large (especially if multiple populations are being analyzed) and a summation over these values may be computationally expensive.

–       Expressions for $P(n_t)$ are susceptible to round-off errors.

Slatkin (2000) proposed an approximation to the summation in $f(x)$ by a single term $f(x|E[n_t])$. This deterministic approximation was based on the observation that $n_t$ changes almost deterministically over time, even though it is a stochastic variable in theory. Thus we can write $n_t \approx E[n_t]$. From Figure 2 in the paper (reproduced here), we can see that this approximation is quite accurate. The authors prove the asymptotic accuracy of this approximation and also prove that under regularity assumptions, $f(x|E[n_t])$ converges to $f(x)$ uniformly in the limits of $t \rightarrow 0$ and $t \rightarrow \infty$ . This is an important result since it shows that the general procedure produces a good approximation for both very recent and very ancient history of the sample. Further, the paper shows how this method can be used to approximate quantities that depend on the trajectory of $n_t$ over time, which can be used to calculate interesting quantities such as the expected number of segregating sites in a genealogy.

## Approximating $E[n_t]$ for single populations

A difficulty with using the deterministic approximation is that $E[n_t]$ often has no closed-form formula, and if one exists, it is typically not easy to compute when the sample is large.

For a single population with changing size, two deterministic approximations have previously been developed (one by Slatkin and Rannala 1997, Volz et al. 2009 and one by Frost and Volz, 2010, Maruvka et al., 2011). Using theoretical and empirical methods, the authors examine the asymptotic behavior and computational complexity of these approximations and a Gaussian approximation by Griffiths. A summary of their results is in the table below.

 Method Accuracy Griffith’s approximation Accurate for large samples and recent history. Slatkin and Rannala (1997), Volz et al. (2009) Accurate for recent history and arbitrary sample size, inaccurate for very ancient history. Frost and Volz (2010), Maruvka et al. (2011) Accurate for both recent and ancient history and for arbitrary sample size. Jewett and Rosenberg (2014) Accurate for both recent and ancient history and arbitrary sample size, and for multiple populations with migration.

## Approximating $E[n_t]$ for multiple populations

Existing approaches only work for single populations of changing size and cannot account for migration between multiple populations. Ethan and Noah extend the framework for single populations to allow multiple populations with migration. The result is a system of simultaneous differential equations, one for each population. While it does not allow for analytical solutions except in very special cases, the system can be easily solved numerically for any given demographic scenario.

## Significance of this work

The extension of the coalescent framework to multiple populations with migration is an important result for demographic inference. The extended framework with multiple populations allows efficient computation of demographically informative quantities such as the expected number of private alleles in a sample, divergence times between populations.

Ethan and Noah describe a general procedure that can be used to approximate coalescent formulas that involve summing over distributions conditioned on $n_t$ or the trajectory of $n_t$ over time. This procedure is particularly accurate for studying very recent or very ancient genealogical history.

The analysis of existing approximations to $E[n_t]$ show that different approximations have different asymptotic behavior and computational complexities. The choice of which approximation to use is therefore often a tradeoff between the computational complexity of the approximation and the likely behavior of the approximation in the parameter ranges of interest.

## Future Directions

As increasingly large genomic samples from populations with complex demographic histories become available for study, exact methods either become intractable or very slow. This work adds to a growing set of approximations to the coalescent and its extensions, joining other methods such as conditional sampling distributions and the sequentially markov coalescent. Ethan and Noah are already exploring applications of these approximate methods to reconciling gene trees with species trees. In the future, I expect that these and other approximations will be important for fast and accurate analysis of large genomic datasets.

## References

[1] Jewett, E. M., & Rosenberg, N. A. (2014). Theory and applications of a deterministic approximation to the coalescent model. Theoretical population biology.

[2] Griffiths, R. C. (1984). Asymptotic line-of-descent distributions. Journal of Mathematical Biology21(1), 67-75.

[3] Frost, S. D., & Volz, E. M. (2010). Viral phylodynamics and the search for an ‘effective number of infections’. Philosophical Transactions of the Royal Society B: Biological Sciences365(1548), 1879-1890.

[4] Maruvka, Y. E., Shnerb, N. M., Bar-Yam, Y., & Wakeley, J. (2011). Recovering population parameters from a single gene genealogy: an unbiased estimator of the growth rate. Molecular biology and evolution28(5), 1617-1631.

[5] Slatkin, M., & Rannala, B. (1997). Estimating the age of alleles by use of intraallelic variability. American journal of human genetics60(2), 447.

[6] Slatkin, M. (2000). Allele age and a test for selection on rare alleles.Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences355(1403), 1663-1668.

[7] Volz, E. M., Pond, S. L. K., Ward, M. J., Brown, A. J. L., & Frost, S. D. (2009). Phylodynamics of infectious disease epidemics. Genetics183(4), 1421-1430.

Paper author Ethan Jewett is a PhD student in the lab of Noah Rosenberg.

# Learning from 69 sequenced Y chromosomes

## Why the Y?

Blog author Amy Goldberg is a graduate students in Noah Rosenberg’s lab.

While mitochondria have been extensively sequenced for decades because of their short length and abundance, the Y chromosome has been under-studied.  Unlike autosomal DNA, the mitochondria and (most of) the Y chromosome are inherited exclusively maternally and paternally, respectively.  Therefore, they do not undergo meiotic recombination.  Without recombination, mutations accumulate on a stable background, preserving a wealth of information about population history.  Each background, shared through a common ancestor, is called a haplogroup. To leverage this information, Poznik et al. set out to sequence 69 males from nine diverse human populations, including a large representation of African individuals.  The paper, published in Science last summer, is by Stanford graduate student David Poznik and a group lead by CEHG professor Dr. Carlos Bustamante.

The structure of the Y chromosome is complex, with large heterochromatic regions, pseudo-autosomal regions that recombine with the X chromosome, and repetitive elements, making mapping reads difficult.  But, the Y chromosome is haploid, allowing for accurate variant calls at lower coverage than the autosomes, which have heterozygotes.  Using high-throughput sequencing (3.1x mean coverage) and a haploid expectation-maximization algorithm, Poznik et al. called genotypes with an error rate around 0.1%. The paper developed important methods for analyzing high-throughput sequences of the difficult Y chromosome, including determining the subset of regions within which accurate genotypes can be called.

## Reconstructing the human Y-chromosome tree

Poznik et al. constructed a phylogenetic tree of the Y chromosome using sequence data and a maximum likelihood approach.  While the overall structure of the tree was known, Poznik et al. were able to accurately calculate branch lengths based on the number of variants differing between individuals and resolve previously indeterminate finer structure.

Figure 2 of the paper: Y-chromosome phylogeny inferred from genomic sequencing.

Incredible African Diversity: One of the key findings of the paper was the depth of diversity within Africans lineages.  While both uniparental and autosomal markers have indicated an African root for human diversity, Poznik et al. find lineages within a single population, the San hunter-gatherers, that coalesce almost at the same time as the entire tree (see haplogroup A). This indicates African diversity and structure has existed for tens of thousands of years, and there is likely more to discover.  A large sample of African populations were considered, which lead to previously unseen structure within haplogroup B2, including structure not mirrored by modern population clustering, that dates to approximately 35,000 years ago.

Evidence of population expansionShort internal branches of the tree, such as those seen within haplogroup E and the non-African group FT, indicate periods of rapid population growth.  When a population expands quickly, new variants that might otherwise drift to extinction can persist.  A large number of coalescence events occur at the time of growth, as there were fewer lineages alive in the population before this time.  For non-African haplogroups, this pattern is likely a remnant of the Out of Africa migration.  For haplogroup E, this corresponds to the Bantu agricultural expansion.

Resolved Eurasian polytomy: Previously, the topology of the Eurasian tree separating haplogroups G-H-IJK was unresolved.  Because of the higher coverage sequencing for this study, Poznik et al. found a single variant, a C to T transition, that differentiates G from the other groups.  Haplogroup G retains the ancestral variant, while H-IJK share the derived variant and are therefore more closely related to each other.

## Sequencing vs. genotyping

In contrast to previous studies, which analyzed small repetitive elements called microsatellites or small sets of single base-pair changes called SNPs, whole-genome sequencing data contains not only more information, but potentially more accurate information.  In particular, before the advent of high-throughput sequencing, SNPs were usually ascertained in a subset of individuals that did not capture worldwide diversity levels.  Therefore, diversity measures are often underestimated and biased.  Without sequence data, the branch lengths of the tree did not have a meaningful interpretation, and the depth of variation within Africa was not seen.

## MRCA of Human Maternal and Paternal Lineages

There was a lot of public discussion spurred by the publication of Poznik’s paper last year.  The discussion mainly focused on their result that, contrary to previous estimates, the most recent common ancestor (MRCA) of all mitochondrial DNA lived at a similar time as that of all Y chromosomes.  Previous estimates put the mitochondrial TMRCA around 200 thousand years ago, with the Y chromosome coalescing a bit over 100 thousand years ago.  These different estimates for Y and mitochondria were often obtained through different sequencing and analysis methods, and are therefore less comparable.  In particular, varying estimates of the mutation rates have led to different TMRCA estimates.  By analyzing both the Y and mitochondria in the same framework, calibrated by archeological evidence and within-species comparisons, Poznik et al. found largely overlapping confidence intervals for the TMRCA of both Y and mitochondria.

But, should the coalescence times of the mitochondria and the Y chromosome be the same? Not necessarily.  While discrepancies between the mitochondria and Y chromosome have often been interpreted as sex-biased population histories or sizes, strictly neutral models can predict large differences between the two, as well.  Because neither the analyzed part of the Y chromosome nor the mitochondria undergo recombination, each acts as a single locus – and therefore represents the history of a single lineage.  For a population, there is a wide distribution of the ages when lineages would coalesce for a given population history, and these loci represent only two with largely independent histories (given the overall population history), therefore they may differ by chance alone.  Similarly, different loci across autosomal DNA have TMRCA ranging from thousands to millions of years. Additionally, as single loci, any effects of selection would distort the entire genealogy of the Y chromosome and mitochondria.

## Future directions

Human population history is far from fully fleshed out, and Poznik et al. provide a framework to leverage increasingly available high-throughput sequencing of Y chromosomes.  The method used to calculate the mutation rate and TMRCA is a valuable contribution in itself, with applications to a wide range of evolutionary and ecological questions.  This study demonstrated that we have only characterized a fraction of worldwide diversity, particularly in Africa, and that increased sampling will be critical to parsing close and far ties in human history.

## Reference

Poznik GD, Henn BM, Yee MC, Sliwerska E, Euskirchen GM, Lin AA, Snyder M, Quintana-Murci L, Kidd JM, Underhill PA, Bustamante CD. Sequencing Y chromosomes resolves discrepancy in time to common ancestor of males versus females. Science. 2013 Aug 2;341(6145):562-5. doi: 10.1126/science.1237619.

Paper author David Poznik is a PhD student in Carlos Bustamante’s lab.

# Caught in the act: how drug-resistance mutations sweep through populations of HIV

Blog author Meredith Carpenter is a postdoc in Carlos Bustamante’s lab.

It has been over 30 years since the emergence of HIV/AIDS, yet the disease continues to kill over one million people worldwide per year [UNAIDS report]. One of the reasons that this epidemic has been so difficult to control is because HIV evolves quickly—it has a short replication time and a high mutation rate, so viruses harboring new mutations that confer drug resistance tend to arise often and spread quickly.

However, the likelihood of one of these beneficial mutations popping up and subsequently “sweeping” through the viral population—i.e., becoming more common because of the survival advantage—also depends on the underlying population genetics, much of which is still poorly understood. In a paper just published in PLoS Genetics, Pleuni Pennings, postdoc in the Petrov lab, and colleagues Sergey Kryazhimskiy and John Wakeley from Harvard tracked the genetic diversity in adapting populations of HIV to better understand how and when new mutations arise.

## Mutations and populations

Mutations are usually caused by either DNA damage (e.g., from environmental factors like UV radiation) or by a mistake during DNA replication. Because HIV is a retrovirus, meaning it must copy its RNA genome into DNA before it can be reproduced in the host cell, it is especially prone to errors that happen during the replication process. The rate that these errors occur, also called the mutation rate, is constant on a per-virus basis —for example, a specific mutation might happen in one virus in a million. As a consequence, the overall number of viruses in the population determines how many new mutations will be present, with a larger population harboring more mutations at any given time.

Whether these mutations will survive, however, is related to what population geneticists call the “effective population size” (also known as Ne), which takes into account genetic diversity. Due to a combination of factors, including the purely random destruction of some viruses, not all mutations will be preserved in the population, regardless of how beneficial they are. The Ne is a purely theoretical measure that can tell us how easily and quickly a new mutation can spread throughout a population. Because it accounts for factors that affect diversity, it is usually smaller than the actual (or “census”) population size.

Pennings and colleagues wanted to determine the Ne for HIV in a typical patient undergoing drug treatment. This is a contentious area: previous researchers examining this question using different methods, including simply summing up overall mutation numbers, came up with estimates of Ne ranging from one thousand to one million (in contrast, the actual number of virus-producing cells in the body is closer to one hundred million, but more on that later). To get a more exact estimate, Pennings took a new approach. Using previously published DNA sequences of HIV sampled from patients over the course of a drug treatment regimen, she looked at the actual dynamics of the development of drug-resistant virus populations over time.

## Swept away

Specifically, Pennings focused on selective sweeps, wherein an advantageous mutation appears and then rises in frequency in the population. Features of these sweeps can give estimates of Ne because they reveal information about the diversity present in the initial population. Pennings sought to distinguish between “hard” and “soft” selective sweeps occurring as the viruses became drug resistant. A hard sweep occurs when a mutation appears in one virus and then rises in frequency, whereas a soft sweep happens when multiple viruses independently gain different mutations, which again rise in frequency over time (see Figure 1). These two types of sweeps have distinct fingerprints, and their relative frequencies depend on the underlying effective population size—soft sweeps are more likely when a population is larger it becomes more likely for different beneficial mutations to independently arise in two different viruses. Soft sweeps also leave more diversity in the adapted population compared to hard sweeps (Figure 1).

Figure 1, an illustration of a hard sweep (left) and a soft sweep (right).

To tell these types of sweeps apart, Pennings took advantage of a specific amino acid change in the HIV gene that encodes reverse transcriptase (RT). This change can result from two different nucleotide changes, either one of which will change the amino acid from lysine to asparagine and confer resistance to drugs that target the RT protein.  Pennings used this handy feature to identify hard and soft sweeps: if she observed both mutations in the same drug-resistant population, then the sweep was soft. If only one mutation was observed, the sweep could be soft or hard, so she also factored in diversity levels to tell these apart. Pennings found evidence of both hard and soft sweeps in her study populations. Based on the frequencies of each, she estimated the Ne of HIV in the patients. Her estimate was 150,000, which is higher than some previous estimates but still lower than the actual number of virus-infected cells in the body. Pennings suggests that this discrepancy could be due to the background effects of other mutations in the viruses that gain the drug-resistance mutation—that is, even if a virus gets the valuable resistance mutation, it might still end up disappearing from the population because it happened to harbor some other damaging mutation as well. This would reduce the effective population size as measured by selective sweeps.

## Implications and future work

Pennings’ findings have several implications. The first is that HIV populations have a limited supply of resistance mutations, as evidenced by the presence of hard sweeps (which, remember, occur when a sweep starts from a single mutation). This means that even small reductions in Ne, such as those produced by combination drug therapies, could have a big impact on preventing drug resistance. The second relates to the fact that, as described above, the likelihood that a mutation will sweep the population may be affected by background mutations in the virus in which it appears. This finding suggests that mutagenic drugs, given in combination with standard antiretrovirals, could be particularly useful for reducing drug resistance.  Now, Pennings is using larger datasets to determine whether some types of drugs lead to fewer soft sweeps (presumably because they reduce Ne). She is also trying to understand why drug resistance in HIV evolves in a stepwise fashion (one mutation at a time), even if three drugs are used in combination.

Paper author Pleuni Pennings is a postdoc in the lab of Dmitri Petrov.

## References

Pennings, PS, Kryazhimskiy S, Wakeley J. Loss and recovery of genetic diversity in adapting HIV populations. 2014, PLoS Genetics.

# Using phyloseq for the reproducible analysis of high-throughput sequencing data in microbial ecology

Blog author Diana Proctor is a graduate student in David Relman’s lab.

The Problem: Data Availability & Scientific Reproducibility

A Current Biology (1) paper evaluating the accessibility of scientific data recently inspired articles and blog posts (2, 3) as well as a lively conversation on Reddit about the “alarming rate of data disappearance” (4). Solutions to the problem of disappearing data include the NIH data sharing policy, as well as data sharing policies set by scientific journals, requiring the deposition of data into public repositories.

As a trainee in David Relman’s Lab thinking about the eventual fate of the high-throughput, next generation sequencing data generated over the course of my dissertation (http://www.hyposalivation.com), this conversation about data accessibility brings to mind a related question – how can I ensure that my data are not lost as fast as the current biology paper predicts?

The Solution: Phyloseq allows microbial ecologists to make reproducible research reports

The solution to data disappearance probably needs to involve not only deposition of data into public repositories, but also the widespread use of reproducible research reports. Luckily for those microbial ecologists among us, Paul McMurdie and Susan Holmes of Stanford University developed an R-based Bioconductor package (i.e., a package for bioinformatics) called phyloseq to facilitate the reproducible statistical analysis of high throughput phylogenetic sequencing datasets, including those generated by barcoded amplicon sequencing, metabolomic, and metagenomic experiments (5, 6). Phyloseq, initially released in 2012, was recently updated by McMurdie & Holmes, and described in an April 2013 publication (6).

Phyloseq Key Features

Phyloseq allows the user to import a species x sample data matrix (aka, an OTU Table) or data matrices from metagenomic, metabolomic, and/or other –omics type experiments into the R computing environment. Previous R extensions, such as OTUbase, also have the capacity to import these data matrices into R, but phyloseq is unique in that it allows the user to integrate the OTU Table, the phylogenetic tree, the “representative sequence” fasta file, and the metadata mapping file into a single “phyloseq-class” R object. The microbial ecologist can then harness all the statistical and graphical tools available in R, including Knitr, R-Markdown and ggplot2 (among others), to generate reproducible research reports with beautiful graphics, as detailed below. To see the report McMurdie used to prepare the phyloseq publication, visit this link: http://www.hyposalivation.com/wp-content/uploads/2014/01/phyloseq_plos1_2012-source-doc.html.

1. Phyloseq incidentally allows the user to curate data

When phyloseq imports the myriad phylogenetic sequencing data objects into R, it scrutinizes the data, making sure that the OTU Table matches the metadata mapping file, the phylogenetic tree, and the representative sequence labels. If not, the user gets an error. If the data descriptors are congruent, a single phyloseq object can be created, which can then be saved along with the R code used to create the object. I have found that this enables me to curate my data – consolidating all the data objects (OTU Table, mapping file, phylogenetic tree, etc.) describing a single experiment into a single multi-level data frame.

2. Phyloseq gives the user the analytical power of R

Importantly, by importing data into the R-computing environment, one may easily perform beta diversity analysis using any or all of over 40 different ecological distance metrics before performing virtually any ordination under the sun. Several alpha diversity metrics are implemented in phyloseq, as well. Finally, after getting the data into R, it’s easy to perform more sophisticated analyses than has previously been possible with this type of dataset, such as k-tables analysis (7), using R’s repertoire of extension packages.

3. Phyloseq makes standardization of sequence data pretty simple

Of particular note, the authors have included in phyloseq several methods to standardize and/or normalize high throughput sequence data. Most of us of course realize the need for data standardization (as evidenced by our reliance on rarefaction), but the tools to easily standardize data, aside from rarefaction, have been lacking (8). The authors of phyloseq have equipped us with several methods (one new!) to standardize our microbial census data, as well as the code needed to accomplish the task (https://github.com/joey711/phyloseq/wiki/Vignettes).

4. Phyloseq makes subsetting large datasets easy

One of my favorite uses for the phyloseq package is that it allows me to easily subset my dataset. In my work, I study the spatial variation of oral microbial communities. I have taken samples from all teeth from the mouths of just a handful of research subjects, but I have samples for certain teeth from all subjects. Phyloseq makes it easy for me to take a complete OTU Table, and subset it on only those teeth that were sampled in all subjects. Similarly, I can subset my OTU Table on a single bacterial phylum or on a single species, or on any covariate in my metadata mapping file, using a single line of R code.

5. Phyloseq enables the user to generate reproducible graphics

The authors of phyloseq created several custom ggplot2 (9) functions, enabling the phyloseq user, with just a few lines of code, to generate all of the most common graphics used in microbial census research (e.g., heatmaps, networks, ordination plots, phylogenetic trees, stacked bar plots for abundance measurements, etc.). Examples of these plots are shown in Figure 1 (though many other possibilities are supported, which can be seen here: http://joey711.github.io/phyloseq/tutorials-index).

Fig 1A. The NMDS ordination plot shows the separation of samples by weighted UniFrac distance for the Global Patterns dataset. Human-associated communities appear to cluster towards the right side of NMDS1 while non-human associated communities cluster towards the left.

Fig 1B. The dodged boxplots show three alpha diversity metrics (Observed species, Chao1, and ACE) on the Y-axis with data classified on the X-axis as either a human-associated or a non-human associated microbial community. This plot shows that non-human associated communities, in general, appear to be much more diverse than human-associated communities.

6. Phyloseq allows covariate data to be visualized with the phylogenetic tree

In particular, phyloseq solves very well the problem of visualizing the phylogenetic tree – it allows the user to project covariate data (such as sample habitat, host gender, etc.) onto the phylogenetic tree, so that relationships between microbes, microbial communities, and the habitat from which they were derived can easily be seen. As an example, the relative abundance of taxa in samples was projected onto the phylogenetic tree along with the environment from which the samples were derived along with the bacterial order in Figure 2. I’ve not seen any other application that allows similar visualizations of the tree, and bootstrapping is also supported. For additional examples, refer to the phyloseq tutorial (http://joey711.github.io/phyloseq/plot_tree-examples.html).

Fig 2: An example of using phyloseq to visualize phylogenetic trees along with covariate data using the Global Patterns dataset. In this figure, the sample type is shown in color, the shapes are bacterial Order, and the size of the shapes indicates the relative abundance of the taxon in the sample.

7. Data & code & results can be saved together to improve scientific reproducibility

One of the key features of phyloseq is that it provides researchers who have access to any system where R is able to run with a framework (i.e., R, R-markdown, Knitr & Rstudio) to perform reproducible statistical analysis of high throughput sequencing data. Using Phyloseq, Rstudio, R-markdown, and Knitr, it’s possible to see in a single .html file the data used to generate a set of figures alongside the code that was used to generate those figures. I now keep a collection of reproducible research reports as part of my lab notebook, and I look forward to being able to publish the final report for my first study along with my first scientific manuscript. For an example, please see the phyloseq tutorials, which were also generated using this approach (http://joey711.github.io/phyloseq/import-data.html).

8. Phyloseq is easy to learn.

When I first began working with Phyloseq, after taking a class taught by the author Susan Holmes, I knew some basic R commands from an undergraduate statistics class. Working with phyloseq made learning R easy for me. Since Phyloseq has a built in set of datasets one can use, it’s easy to reproduce the figures published in the phyloseq paper, as a stepping-stone for creating figures of one’s own.

Conclusion

An R-based package called Phyloseq makes it easy to analyze high throughput microbial census data, visualize the data, and perform reproducible statistical analysis. Phyloseq should facilitate conversations between researchers who publish data and the consumers of it with its emphasis on reproducible research. This should help those of us in the infancy of microbiome research ensure that our data do not disappear as quickly as Vine et al. currently predicts.

The paper and the phyloseq package are co-authored by Paul McMurdie and Professor Susan Holmes (in the picture).

REFERENCES

2.         Stromberg J. smithsonianmag.com: 2014.]. Available from: http://blogs.smithsonianmag.com/science/2013/12/the-vast-majority-of-raw-data-from-old-scientific-studies-may-now-be-missing/.

3.         Noorden EGRV. Scientists losing data at a rapid rate. Nature News. 2014.

4.         bobmanyun. reddit.com: 2014 1-7-2014. [cited 2014]. Available from: http://www.reddit.com/r/science/comments/1tb2d3/scientific_data_is_disappearing_at_alarming_rate/.

# Which genetic variants determine histone marks?

Blog author Joe Davis is a graduate student with Stephen Montgomery & Carlos Bustamante.

The wealth of genetic variation in the human genome is found not within protein-coding genes but within non-protein coding regions. This comes as no surprise given that only 1% percent of the genome codes for proteins. Until recently, efforts to determine the effects of genetic variation on trait variation and disease have focused on coding regions. Results of genome-wide association studies (GWAS), however, have shown that trait and disease associated variants are often regulatory variants such as expression quantitative trait loci (eQTLs) found in non-coding regions. These results have spurred an effort to understand the functional role of non-coding, regulatory variation. Efforts have thus far relied on characterizing the association between variants and gene expression. This association alone, however, will not reveal the complete functional mechanism by which non-coding variants influence gene expression. Recent efforts have therefore begun to characterize numerous molecular phenotypes such as transcription factor (TF) binding, histone modification, and chromatin state to determine the mechanisms by which regulatory variants affect gene expression.

## One issue, four papers

In the November 8 issue of Science, three papers were published that address the role of non-coding genetic variation on TF binding, histone modifications, and chromatin state (i.e. active versus inactive enhancer status). The first study was completed by the Dermitzakis Lab at the University of Geneva. They analyzed three TFs, RNA polymerase II (Pol II), and five histone modifications using chromatin immunoprecipitation and sequencing (ChIP-Seq) in lymphoblastoid cell lines (LCLs) from two parent-child trios [1]. The second was completed by the Pritchard Lab, which has recently moved to Stanford, and the Gilad Lab at the University of Chicago. They identified genetic variants affecting variation in four histone modifications and Pol II occupancy in ten unrelated Yoruba LCLs [2]. The third study was performed by the Snyder Lab at Stanford. They characterized the genetic variation underlying changes in chromatin state using RNA-Seq and ChIP-Seq for four histone modifications and two DNA binding factors in 19 LCLs from diverse populations [3]. This work was the subject of a recent CEHG Evolgenome talk given by Maya Kasowski, the study’s first author. Finally, the fourth study, published in the November 28 issue of Nature, was performed by the Glass Lab at UCSD. They characterized the effect of natural genetic variation between two mouse strains on the binding of two TFs involved in cell differentiation (PU.1 and C/EBPα) using ChIP-Seq [4]. In this post, I will analyze primarily the work presented by the Pritchard Lab, but I strongly recommend reading all four papers to understand the challenges in characterizing non-coding variation and the methods available to do so.

## Motivation

The four studies seek to answer the general question of how regulatory variation affects gene expression. They characterize diverse molecular phenotypes such as histone modifications and TF binding to understand the mechanisms of action for non-coding variants. The Pritchard Lab focused their study on four histone modifications (three active and one repressive: H3K4me3, H3K4me1, H3K27ac, and H3K27me3, respectively) and Pol II occupancy.

## Histone modifications 101

Histone modifications refer to the addition of chemical groups such as methyl or acetyl to specific amino acids on the tails of histone proteins comprising the nucleosome. These chemical groups are referred to as histone marks. They can serve a wide range of functions, but in general they are associated with the accessibility of a chromatin region. For example, the tri-methylation of lysine 4 of histone 3 (H3K4me3) is associated with increased chromatin accessibility and gene activation. On the other hand, increased levels of the repressive mark H3K27me3 (tri-methylation of lysine 27 of histone 3) at promoters is associated with gene inactivation.

Histone mark levels are measured in a high-throughput manner using ChIP-Seq. Briefly, an antibody targeting the mark of interest is used to pull down modified genomic regions. These immunoprecipitated regions are then sequenced to determine which genomic segments are modified and at what level. The procedure usually requires a large number of cells (on the order of 10^7). Therefore, the modification level is, in some ways, a population level measurement. Analysis of ChIP-Seq data typically involves testing for genomic regions with more reads than expected by chance. These regions, ranging from 200bp to 1000bp or more, are referred to as peaks that represent a modification level above the genomic background. Repressive marks like H3K27me3 tend to have broad peak regions, while activating marks like H3K4me3 can have much tighter peaks.

Since modification levels represent measurements on a population of cells and histone residues can have multiple modifications, genomic regions can show evidence for multiple marks. The combinations of these marks over a region can mark the function of the region. For example, regions with high levels of H3K27ac and a high ratio of H3K4me1 to H3K4me3 can mark active enhancer regions. Until now, the variation of these marks between individuals and the genetic cause of this variation was uncharacterized. Moreover, the causal impact of these marks remains unknown. Do they alter gene expression directly or are they altered by gene regulation? Therefore, the two guiding questions for this study are:

1. What genetic variants influence histone modifications?

2. Are these modifications “a cause or a consequence of gene regulation?”

## Variation in histone modifications, a real whodunit

The authors first seek to identify and characterize genetic variants that influence histone marks. They generated ChIP-Seq data for the four histone marks and Pol II in LCLs derived from ten unrelated Yoruba individuals who were previously genotyped as part of the 1000 Genomes Project. Similar studies of regulatory variants such as eQTL studies require large sample sizes to detect the effects of regulatory variants that often lie outside the gene. Unlike eQTL studies, histone marks cover fairly broad regions often encompassing causal regulatory variants. As a result, the authors can use a smaller sample size and still be confident about interrogating the effects of causal regulatory SNPs. The authors developed a statistical test that models total read depth between individuals and allelic imbalance between haplotypes within individuals to increase power to detect cis-QTLs (i.e. variants that affect histone marks and Pol II occupancy nearby in the genome). Using this method, they identified over 1200 distinct QTLs for histone marks and Pol II occupancy (FDR 20%).

The authors then analyze these histone mark and Pol II QTLs to determine the overlap of these variants with other known regulatory variants. The hypothesis is that regulatory variants that affect gene expression will have effects on diverse molecular phenotypes. Therefore, variants that influence histone marks and Pol II should show significant overlap with known regulatory variants such as eQTLs and DNase I sensitivity QTLs (dsQTLs). DNase I sensitivity is a measure of chromatin accessibility with higher sensitivity associated with higher accessibility. The Pritchard Lab mapped eQTLs and dsQTLs in a larger sample of ~75 Yoruban LCLs in two previous studies that I also recommend reading [5,6]. Their analysis revealed an enrichment of low p-values for dsQTLs and, to a lesser extent, eQTLs when tested as histone mark and Pol II QTLs. In addition, the authors observed a coordinated change in multiple molecular phenotypes at dsQTLs and eQTLs. For example, higher levels of the three histone active marks were observed at dsQTLs for the more DNase I sensitive genotype. At eQTLs, H3K4me3, H3K27ac, and Pol II levels were higher for individuals with the high expression genotype. These results show that non-coding regulatory variants impact multiple molecular phenotypes ranging from chromatin accessibility and transcription to histone modifications. The authors provide strong evidence in response to their first guiding question, namely that non-coding regulatory polymorphisms associate with variation in histone marks and Pol II.

## TFs and a question of directionality

The authors then turned to addressing the questions of causality for these marks. To do so, they analyze genetic variants in TF binding sites. The main hypothesis is that regulatory variants that alter a TFBS will modify TF binding which will cause changes in histone mark and Pol II levels nearby. If this is the case, then changes in histone marks are a consequence of how strong the TF binding site is. On the other hand, if these marks were causal, polymorphisms in TF binding sites would not be expected to show strong association with changes in these marks.

To test their hypothesis, the authors examine ~11.5K TF binding sites with polymorphisms heterozygous in at least 1 of their 10 individuals. They calculate the change in position weight matrix (PWM) score between the two alleles for polymorphic TF binding sites within each individual. They then test for significant association between this change in PWM and allelic imbalance of ChIP-Seq reads at nearby heterozygous sites. The idea is that if a variant improves (or disrupts) TF binding for one allele at a TF binding site then active histone marks nearby on the same allele will increase (or decrease). Repressive histone marks (in this case H3K27me3) are expected to have the opposite response. Indeed, when they apply their test, they find a significant positive association for the active marks and a negative association for the repressive mark. This result supports the hypothesis of changes histone marks as a consequence of TF binding and gene regulation. However, this result does not rule out other possibilities. Histone marks can still play a causal role in the establishment of TF binding. In other words, the relationship between TF binding and histone marks does not have to be unidirectional. In addition, there is evidence that long non-coding RNAs may play a role in the establishment and regulation of histone marks.

## dsQTLs and eQTLs, a match made on chromatin

In their final analysis, the authors examine dsQTLs that are also eQTLs. Since these variants associate with both gene expression and chromatin accessibility at distal regulatory regions (>5kb from associated TSS), the authors can assign the regulatory region to a specific gene. A variant that is both a dsQTL and an eQTL likely disrupts a distal regulatory region. In addition to disrupting the accessibility of the regulatory region, the variant also perturbs the expression of a gene influenced by the regulatory region. For example, a variant may decrease the chromatin accessibility of an enhancer region and thereby decrease the level of active histone marks for the enhancer. This decreased enhancer activity can result in decreased transcription from a nearby gene and similarly decreased active mark levels for the gene. Therefore, the hypothesis guiding this analysis is that variants influencing the histone marks of a distal regulatory region will have a coordinated effect on histone marks at genes under the control of the regulatory region. The authors examine the allelic imbalance in ChIP-Seq reads at regulatory regions and their associated transcription start sites (TSS). Indeed, the authors observe that variants that increase DNase I sensitivity have a significant positive allelic imbalance for active marks at both the regulatory region and the TSS. The opposite is true for the repressive mark. This result again emphasizes the complexity of gene regulation and the impact of non-coding variation. Not only do regulatory variants influence diverse molecular phenotypes nearby, they can direct changes at distal loci. As the authors note, this coordinated change in histone marks between distal regions possibly reflects the 3D organization of chromatin. Regulatory variants that impact chromatin looping interactions between distal regulatory regions and genes may cause changes in activity levels for both the gene and the regulatory region.

## Conclusions

This paper provides clear evidence that regulatory variation has very complex impacts affecting multiple and diverse molecular phenotypes at multiple regions simultaneously. This complexity implies potentially numerous and diverse mechanisms by which regulatory variants act on gene regulation. The authors set out to find evidence for one of these mechanisms, namely perturbation of TF binding sites. They begin by showing that variation in histone modifications has a strong genetic basis and that the polymorphisms influencing these marks overlap with known regulatory variants such as eQTLs. They then show that polymorphisms in TF binding sites associate with changes in histone marks, providing evidence for directionality in the relationship between these marks and gene regulation. In essence, their results suggest that histone modifications are directed, at least in part, by TF binding. Finally, they find that regulatory variants can have an impact on the molecular phenotypes of distal regions.

I found this paper, as well as the other three previously mentioned, to be quite interesting. I think these papers show that our understanding of gene regulation is still very simplistic. With the advent of high-throughput molecular assays like ChIP-Seq and DNase-Seq, we can begin to interrogate the complex role of regulatory variation on many phenotypes. In doing so, it is of primary interest to ask questions regarding directionality. How do a given set of molecular phenotypes relate? Do these phenotypes represent a cause or a consequence of genome function? How do the diverse elements of gene regulation function together to build complex phenotypes?

## References

Paper author Jonathan Pritchard is a professor in the Departments of Genetics and Biology.

# Genomic analyses of ancestry of Caribbean populations

Blog author Rajiv McCoy is a graduate student in the lab of Dmitri Petrov.

In the Author Summary of their paper, “Reconstructing the Population Genetic History of the Caribbean”, Andrés Moreno-Estrada and colleagues point out that Latinos are often falsely depicted as a homogeneous ethnic or cultural group.  In reality, however, Latinos, including inhabitants of the Caribbean basin, represent a diverse mixture of previously separate human populations, such as indigenous groups, European colonists, and West Africans brought over during the Atlantic slave trade.  This mixing process, which geneticists call “admixture”, left a distinct footprint on genetic variation within and between Caribbean populations.  By surveying genotypes of 330 Caribbean individuals and comparing to a database of variation from more than 3000 individuals from European, African, and Native American populations, Moreno et al., explore the genomic outcomes of this complex admixture process and reveal intriguing demographic patterns that could not be obtained from the historical record alone. The paper, featured in the latest edition of PLOS Genetics, represents a collaborative project with co-senior authorship by Stanford CEHG professor Carlos Bustamante and Professor Eden Martin from the University of Miami Miller School of Medicine.

Reconstructing the demographic history of admixed populations

Because parental DNA is only moderately shuffled before being incorporated into gametes (the process of meiotic recombination), admixture results in discrete genomic segments that can be traced to a particular ancestral population.  In early generations after the onset of admixture, these segments are large.  However, after many generations, segments will be quite small.  By investigating the distribution of sizes of these ancestry “tracts”, Moreno and colleagues inferred the timing of various waves of migration and admixture.  For Caribbean Island populations, they infer that European gene flow first occurred ~16-17 generations ago, which matches very closely to the historical record of ~500 years, assuming ~30 years per generation.  In contrast, for neighboring mainland populations from Colombia and Honduras, they find that European gene flow occurred in waves, starting more recently (~14 generations ago).

Identifying sub-continental ancestry of admixed individuals

Those familiar with human population genetics will recognize principal component analysis (PCA), which transforms a matrix of correlated observed genotypes into a set of uncorrelated variables where the first component explains the most possible variance, the second variable explains the second most variance, and so on.  Individuals’ transformed genotypes can be plotted on the first two principle components, and when performed on a worldwide scale, distinct clusters appear which represent populations of ancestry.  On conventional PCA plots, admixed individuals fall between their different ancestral populations, as they possess sets of genotypes diagnostic of multiple ancestral groups.  As virtually all Caribbean individuals are admixed to some degree, this pattern is apparent for Caribbean populations (see Figure 1B from the paper, reproduced below).

While interesting, this means that the sub-continental ancestry of these admixed individuals is difficult to ascertain.  An individual may want to know which Native American, West African, and European populations contribute to his or her ancestry, and this analysis does not have sufficient resolution to answer these questions.

Moreno and colleagues therefore devised a new version of PCA called ancestry-specific PCA (ASPCA), which extracts genomic segments assigned to Native American, West African, and European ancestry, then analyzes these segments separately, dealing with the large proportions of missing data that result.  In the case of Native American ASPCA, they observe two overlapping clusters.  The first represents mostly Colombians and Hondurans, who cluster most closely with indigenous groups from Western Colombia and Central America and have a greater overall proportion of Native American ancestry.  The second cluster represents mostly Cubans, Dominicans, and Puerto Ricans, who cluster most closely with Eastern Colombian and Amazonian indigenous groups.  This makes sense in light of the fact that Amazonian populations from the Lower Orinoco Valley settled on rivers and streams, which could have facilitated their migration.  Because indigenous ancestry proportions were relatively consistent and closely clustered across different Caribbean Islands, the authors posit that there was a single pulse of expansion of Amazonian natives across the Caribbean prior to European arrival, along with gene flow among the islands.

In the case of European ASPCA, Moreno et al. found that Caribbean samples clustered closest to, but clearly distinct from, present day individuals from the Iberian Peninsula in Southern Europe.  In fact, the differentiation between this “Latino-specific component” and Southern Europe is at least as great as the differentiation between Northern and Southern Europe.  The authors hypothesize that this is due to very small population sizes among European colonists, which would have introduced noise into patterns of genomic variation through the process of random genetic drift.

Finally, the authors demonstrate that Caribbean populations have a higher proportion of African ancestry compared to mainland American populations, a result of admixture during and after the Atlantic slave trade.  Surprisingly, the authors found that all samples tightly clustered with present day Yoruba samples from Nigeria rather than being dispersed throughout West Africa.  However, because other analyses suggested that there might have been two major waves of migration from West Africa, the authors decided to analyze “old” and “young” blocks of African ancestry separately.  This analysis revealed that “older” segments are primarily derived from groups from the Senegambia region of Northwest Africa, while “younger” segments likely trace to groups from the Gulf of Guinea and Equatorial West Africa (including the Yoruba).

Conclusions and perspectives

This groundbreaking study has immediate implications for the field of personalized medicine, especially due to the discovery of a distinct Latino-specific component of European ancestry.  The hypothesis that European colonists underwent a demographic bottleneck (a process termed the “founder effect”) has expected consequences for the frequency of damaging mutations contributing to genetic disease. The observation of extensive genetic differences among Caribbean populations also argues for more such studies characterizing genetic variation on a smaller geographic scale. The newly developed ASPCA method will surely be valuable for other admixed populations.  In addition to medical implications, studies such as this help dispel simplistic notions of race and ethnicity and inform cultural identities based on unique and complex demographic history.

Citation: Moreno-Estrada A, Gravel S, Zakharia F, McCauley JL, Byrnes JK, et al. (2013) Reconstructing the Population Genetic History of the Caribbean. PLoS Genet 9(11): e1003925. doi:10.1371/journal.pgen.1003925

Paper author Andrés Moreno-Estrada is a research associate in the lab of Carlos Bustamante.

# How recombination and changing environments affect new mutations

Blog author: David Lawrie was a graduate student in Dmitri Petrov’s lab. He is now a postdoc at USC.

I recently sat down with Oana Carja, a graduate student with Marc Feldman, to discuss her paper published in the journal of Theoretical Population Biology entitled “Evolution with stochastic fitnesses: A role for recombination”. In it, the authors Oana Carja, Uri Liberman, and Marcus Feldman explore when a new mutation can invade an infinite, randomly mating population that experiences temporal fluctuations in selection.

## The one locus case

This work builds off of previous research in the field on how the fluctuations in fitness over time (i.e., increased variance of fitness) affect the invasion dynamics of a mutation at a single locus. For a single locus, it has been shown that the geometric mean of the fitness of the allele over time determines the ability of an allele to invade a population. This effect is known as the geometric mean principle. Fluctuations in fitness increase the variance and therefore decrease the geometric mean fitness. The variance of the fitness of the allele over time thus greatly impacts the ability of that allele to invade a population.

## What if there are two loci?

In investigating a two locus model, the researchers split the loci by their effect on the temporally-varying fitness: one locus only affects the mean, while the other controls the variance. The authors demonstrate through theory and simulation that:

1)    allowing for recombination between the two loci increases the threshold for the combined fitness of the two mutant alleles to invade the population beyond the geometric mean (see figure).

2)    periodic oscillations in the fitness of the alleles over time lead to higher fitness thresholds for invasion over completely random fluctuations (see figure).

3)    edge case scenarios allow for the maintenance of polymorphisms in the population despite clear selective advantages of a subset of allelic combinations.

Temporally changing environments and recombination thus make it overall more difficult for new alleles to invade a population.

Invasibility thresholds as a function of recombination rate. If there is no recombination (the left-most edge of the figure), the geometric mean of the pair of new alleles needs to be higher than 0.5 to allow for invasion, because the resident alleles’ geometric mean fitness is set to 0.5. However, as recombination between the two loci increases, the geometric mean needed for invasion increases rapidly. If there is free recombination (r = 0.5) then  invasion can only happen if the new alleles’ geometric mean fitness is twice the resident alleles’ geometric mean fitness (light grey area). If the environment is changing periodically, it is even harder for new alleles to invade a population (dark grey area).

## The evolution of models of evolution

This work is important for addressing the evolutionary dynamics of loci controlling phenotypic variance – in this case, controlling the ability of a phenotype to maintain its fitness even if the environment is variable. Most environments undergo significant temporal shifts from the simple changing of the seasons to larger scale weather changes such as El Niño and climate change, in which species must survive and thrive. For organisms in the wild, many alleles that confer a benefit in one environment will be deleterious when the environment and selective pressures change. There may be modifier-loci which buffer the fitness of those loci in the face of changing environments. Such modifier-loci have been recently found in GWAS studies and may be important for overall phenotypic variance. Thus modeling the patterns of evolution for multiple loci in temporarily varying environments is a key component to advancing our understanding of the patterns found in nature.

## Future work

Epigenetic modifiers are a hot area of research and one potential biological mechanism to control phenotypic variance. The evolution of such epigenetic regulation is a particular research interest of Oana. Future work will continue to explore the evolutionary dynamics of epigenetic regulation and focus on applying the above results to finite populations.

Paper author: Oana Carja is a graduate student with Marc Feldman

## Reference

Blog-author: Sandeep Venkataram is a grad student in Dmitri Petrov’s lab.

By Sandeep Venkataram – Modern humans have been evolving independently of our nearest living relatives, chimpanzees, for over 7 million years. To study our evolutionary history since this divergence, our major source of information is from the fossilized remains of our ancestors and closely related species such as Neanderthals. Physiological information from the remains can tell us a lot about human evolution, but the majority of the information is locked up in the tiny amounts of highly degraded and fragmented DNA left from the specimen.

Studying ancient DNA (aDNA) from bones is extremely challenging. Each sample contains not only the fossil’s DNA, but contaminating DNA from enormous numbers of microbes and other organisms. There is also the possibility of human contamination from handling the fossil. Since the contaminants have much higher quality DNA than the endogenous sample, simply processing a raw DNA extract of the sample typically yields sequencing results with less than 1% aDNA. Sequencing such a sample is incredibly inefficient, as less than 1% of the sequence is actually useful and only the best funded studies can afford the sequencing costs to generate a high coverage genome. Therefore, most aDNA studies tend to focus on mitochondrial DNA using targeted capture methods, or PCR amplicon sequencing. This reduces wasted sequence capacity, but greatly limits the amount of information obtained from the sample.

## Getting the most out of ancient DNA

Meredith Carpenter, who is a postdoc here at Stanford, and her colleagues have developed a novel method called whole genome in-solution capture (WISC) to purify aDNA from next-gen sequencing libraries generated from fossil DNA samples. The authors make use of RNA bait technology (Figure 1), which uses RNA complementary to the targeted DNA that has been synthesized using biotinylated nucleotides (Gnirke et al 2009). The RNA can be hybridized to the DNA pool, and purified from solution by linking the biotin to commercially available purification beads. By then eluting the hybrids from the beads and removing the RNA from solution, one can greatly enrich for the targeted DNA. Previous methods using RNA baits required synthesizing DNA strands on microarrays that are complementary to the RNA, then inducing transcription in vitro to generate the RNA bait library. This methodology has been successfully used to capture the entirety of chromosome 21 (Fu et al 2013) from human fossils, but is not cost effective for producing a bait library that covers the entire human genome.

Meredith Carpenter and coauthors circumvent this by mechanically fragmenting human genomic DNA and use blunt end ligation to attach adapter sequences containing an RNA polymerase promoter. This modified DNA library can then be used to generate the biotinylated RNA bait library via in vitro transcription, after which the RNA library is purified using standard protocols. The authors also generate non-biotinylated RNA complementary to the adapter sequences present on every DNA fragment in the library, to block nonspecific binding to the adapters. The bait and block RNA libraries are hybridized to the aDNA libraries, purified using beads to select only the aDNA fragments and sequenced. The cost of their enrichment method is estimated at \$50 per sample, and is accessible to most labs conducting aDNA genomic studies.

## WISC greatly enriches for ancient DNA across a variety of samples

The authors tested this method on a variety of aDNA libraries prepared from both high quality and low quality samples, including hair remains, teeth and bones, and fossils from tropical and more temperate regions, which can greatly influence DNA quality. They sequenced the libraries both before and after WISC, and found a 3-13x increase in the number of uniquely mapped reads after using WISC. In addition, most of the unique reads in the enriched library are sequenced with five million reads in both hair and bone samples. WISC allows most of the endogenous sequence to be read from dozens of aDNA samples in a single lane of Illumina HiSeq, opening the possibility of sequencing the millions of fossils in museums and collections around the world.

Dr. Carpenter says they are now focusing on adapting the method to removing human DNA contamination from microbiome sequencing projects with promising preliminary results, as well as applications in forensics and studying extinct species. As WISC generates the RNA bait library from genomic DNA of an extant relative instead of synthetic DNA arrays, bait libraries can be prepared regardless of whether the genome of the organism that is the source of the bait library is known. Combined with recent advances in aDNA library construction methods (Meyer et al 2012), WISC promises to make sequencing of contaminated and degraded samples widely accessible.

Carpenter et al (2013) Figure 1. Schematic of the Whole-Genome In-Solution Capture Process
To generate the RNA “bait” library, a human genomic library is created via adapters containing T7 RNA polymerase promoters (green boxes). This library is subjected to in vitro transcription via T7 RNA polymerase and biotin-16-UTP (stars), creating a biotinylated bait library. Meanwhile, the ancient DNA library (aDNA “pond”) is prepared via standard indexed Illumina adapters (purple boxes). These aDNA libraries often contain <1% endogenous DNA, with the remainder being environmental in origin. During hybridization, the bait and pond are combined in the presence of adaptor-blocking RNA oligos (blue zigzags), which are complimentary to the indexed Illumina adapters and thus prevent nonspecific hybridization between adapters in the aDNA library. After hybridization, the biotinylated bait and bound aDNA is pulled down with streptavidin-coated magnetic beads, and any unbound DNA is washed away. Finally, the DNA is eluted and amplified for sequencing.

Paper author Meredith Carpenter is a postdoc in Carlos Bustamante’s lab.

## References

Carpenter, M. L., Buenrostro, J. D., Valdiosera, C., Schroeder, H., Allentoft, M. E., Sikora, M., Rasmussen, M., et al. (2013). Pulling out the 1%: Whole-Genome Capture for the Targeted Enrichment of Ancient DNA Sequencing Libraries. The American Journal of Human Genetics, 1–13. doi:10.1016/j.ajhg.2013.10.002

Fu, Q., Meyer, M., Gao, X., Stenzel, U., Burbano, H. a, Kelso, J., & Pääbo, S. (2013). DNA analysis of an early modern human from Tianyuan Cave, China. Proceedings of the National Academy of Sciences of the United States of America, 110(6), 2223–7. doi:10.1073/pnas.1221359110

Gnirke, A., Melnikov, A., Maguire, J., Rogov, P., LeProust, E. M., Brockman, W., Fennell, T., et al. (2009). Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nature biotechnology, 27(2), 182–9. doi:10.1038/nbt.1523

Meyer, M., Kircher, M., Gansauge, M.-T., Li, H., Racimo, F., Mallick, S., Schraiber, J. G., et al. (2012). A high-coverage genome sequence from an archaic Denisovan individual. Science (New York, N.Y.), 338(6104), 222–6. doi:10.1126/science.1224344

Update: The second paragraph originally talked about fossils, but it should have been bones (now corrected). A fossil is mineralized and would not yield aDNA.

# Biology Thinks Big! Symposium great success

Jeremy Hsu is a graduate student in the Hadly lab. He writes about the Biology Thinks Big! Symposium which he organized. It was a great success. This post was previously published on the Hadly Lab blog

On Wednesday evening, we held the first Stanford Biology Thinks Big! symposium, where four Stanford faculty from across the biosciences each gave a short, ten-minute TED-like talk about a big idea in their research or their field.

I first starting thinking about hosting such an event last year, as a way to showcase the amazing diversity of scientific research and viewpoints within the biosciences at Stanford, and to introduce people to science and research in an accessible, engaging format. To that end, I was thrilled with the four speakers we had, with John Boothroyd, Mary Teruel, Carlos Bustamante, and Susan McConnell each representing different departments here at Stanford. Even with such an amazing line-up of faculty, though, I have to admit that I was a bit nervous about turnout for the event – this is the first time anyone’s tried doing such an event here, and I had no idea what to expect.

The first indication that my nervousness was unwarranted, though, came when people starting arriving thirty minutes prior to the start time! More and more people filled in, and the response was overwhelming – not only was our room at maximum capacity (with the aisles filled with people sitting and standing), but our overflow room next door was also absolutely jam packed, with people still trying to peer in from outside the doors.

It was clear that the audience was very enthusiastic and eager about this event. I took a moment to poll the audience at the beginning, and there was a good distribution of undergraduates, graduate students, post-docs, and general community members!

Despite some of our A/V equipment not cooperating in our overflow room, the excitement of the audience was palpable. Dr. Boothroyd (Department of Microbiology and Immunology) began by talking about his work on a specific microbe, and Dr. Teruel (Department of Chemical and Systems Biology) highlighted the mechanisms behind the development of fat cells.

Next, Dr. Bustamante (Department of Genetics) engaged the audience in a wide-ranging discussion on genetics, before Dr. McConnell (Department of Biology) wrapped up the symposium with an inspiring message on science and conservation, using her own photographs of nature and animals.

It was a great experience to hear four different perspectives on science, and to listen to such a diversity of ideas. Afterward, I was thrilled to see so many students come up to each of the speakers to ask more questions about their work, and many of the students even asked about advice regarding finding and joining a lab or the graduate school process!

Overall, it was clear that everyone was eager to hear the big ideas that Stanford researchers have, and I’m already looking forward to the next time we do this!

— Jeremy

# The fruit fly and its microbiome

Philipp Messer is a research associate in the Petrov lab

This post was written by Philipp Messer.

Although fruit flies are one of the most important model organisms in genetics, evolution, and immunology, surprisingly little is known about their associated microorganisms (their microbiome). This is even the more surprising if you consider that the microbiome can strongly affect quantitative traits in flies, for example their growth rate and cold tolerance. Furthermore, the natural environment of fruit flies – rotting fruit – is very rich in microorganisms.

All organisms interact with associated microbes

Because microbes can influence the phenotype of organisms, we expect such interactions to be subject to natural selection. Genes involved in pathogen defense are indeed amongst the fastest evolving genes. But interactions with microbes do not always just lead to an evolutionary arms race between microbes and their hosts, they can also facilitate major evolutionary innovations. Prominent examples of such innovations are the light organ of the bobtail squid that arose through a symbiotic relationship between squids and bioluminescent bacteria, or cellulose digestion in termites which relies on microbes in their guts. Hence, to improve our understanding of the evolution of fruit flies, we need to better understand how they interact and coevolve with their associated microorganisms.

In their paper “Host species and environmental effects on bacterial communities associated with Drosophila in the laboratory and in the natural environment”, Fabian Staubach and his colleagues at Stanford and the Max Planck Institute for Evolutionary Biology in Plön shed light on some of the major questions regarding Drosophila associated microbes. Beyond finding out which bacteria are present in flies, they assess the relative roles of host species and environmental effects on bacterial communities, detect candidate natural pathogens, and find interesting results regarding lab-of-origin-effects on the fly microbial community.

We need more studies like this

These results are not only highly relevant for everyone working with Drosophila, but are also a strong reminder that we cannot understand any model organism without taking its associated microbiota into account. We therefore need more microbiome studies like that of Staubach et al to identify the microbes that coevolve with their hosts and understand how the genomes of hosts and microbes interact in the evolutionary process. I would not be surprised if interactions between microbes and their hosts turn out to be among the biggest selective forces in many organisms.

The paper is a fun and easy read and can be found at here. Fabian was a postdoc in the Petrov lab from 2010 to 2013 and has just moved to the University of Freiburg in Germany to start his own group, where he plans to follow his interest to deepen our understanding of the role of microbes in adaptation.

Citation: Staubach F, Baines JF, Künzel S, Bik EM, Petrov DA (2013) Host Species and Environmental Effects on Bacterial Communities Associated with Drosophila in the Laboratory and in the Natural Environment. PLoS ONE 8(8): e70749. doi:10.1371/journal.pone.0070749

Fabian Staubach studies the microbiome of fruitflies.