Using phyloseq for the reproducible analysis of high-throughput sequencing data in microbial ecology

Blog author Diana Proctor is a graduate student in David Relman’s lab.

Blog author Diana Proctor is a graduate student in David Relman’s lab.

The Problem: Data Availability & Scientific Reproducibility

A Current Biology (1) paper evaluating the accessibility of scientific data recently inspired articles and blog posts (2, 3) as well as a lively conversation on Reddit about the “alarming rate of data disappearance” (4). Solutions to the problem of disappearing data include the NIH data sharing policy, as well as data sharing policies set by scientific journals, requiring the deposition of data into public repositories.

As a trainee in David Relman’s Lab thinking about the eventual fate of the high-throughput, next generation sequencing data generated over the course of my dissertation (, this conversation about data accessibility brings to mind a related question – how can I ensure that my data are not lost as fast as the current biology paper predicts?

The Solution: Phyloseq allows microbial ecologists to make reproducible research reports

The solution to data disappearance probably needs to involve not only deposition of data into public repositories, but also the widespread use of reproducible research reports. Luckily for those microbial ecologists among us, Paul McMurdie and Susan Holmes of Stanford University developed an R-based Bioconductor package (i.e., a package for bioinformatics) called phyloseq to facilitate the reproducible statistical analysis of high throughput phylogenetic sequencing datasets, including those generated by barcoded amplicon sequencing, metabolomic, and metagenomic experiments (5, 6). Phyloseq, initially released in 2012, was recently updated by McMurdie & Holmes, and described in an April 2013 publication (6).

Phyloseq Key Features

Phyloseq allows the user to import a species x sample data matrix (aka, an OTU Table) or data matrices from metagenomic, metabolomic, and/or other –omics type experiments into the R computing environment. Previous R extensions, such as OTUbase, also have the capacity to import these data matrices into R, but phyloseq is unique in that it allows the user to integrate the OTU Table, the phylogenetic tree, the “representative sequence” fasta file, and the metadata mapping file into a single “phyloseq-class” R object. The microbial ecologist can then harness all the statistical and graphical tools available in R, including Knitr, R-Markdown and ggplot2 (among others), to generate reproducible research reports with beautiful graphics, as detailed below. To see the report McMurdie used to prepare the phyloseq publication, visit this link:

1. Phyloseq incidentally allows the user to curate data

When phyloseq imports the myriad phylogenetic sequencing data objects into R, it scrutinizes the data, making sure that the OTU Table matches the metadata mapping file, the phylogenetic tree, and the representative sequence labels. If not, the user gets an error. If the data descriptors are congruent, a single phyloseq object can be created, which can then be saved along with the R code used to create the object. I have found that this enables me to curate my data – consolidating all the data objects (OTU Table, mapping file, phylogenetic tree, etc.) describing a single experiment into a single multi-level data frame.

2. Phyloseq gives the user the analytical power of R

Importantly, by importing data into the R-computing environment, one may easily perform beta diversity analysis using any or all of over 40 different ecological distance metrics before performing virtually any ordination under the sun. Several alpha diversity metrics are implemented in phyloseq, as well. Finally, after getting the data into R, it’s easy to perform more sophisticated analyses than has previously been possible with this type of dataset, such as k-tables analysis (7), using R’s repertoire of extension packages.

3. Phyloseq makes standardization of sequence data pretty simple

Of particular note, the authors have included in phyloseq several methods to standardize and/or normalize high throughput sequence data. Most of us of course realize the need for data standardization (as evidenced by our reliance on rarefaction), but the tools to easily standardize data, aside from rarefaction, have been lacking (8). The authors of phyloseq have equipped us with several methods (one new!) to standardize our microbial census data, as well as the code needed to accomplish the task (

4. Phyloseq makes subsetting large datasets easy

One of my favorite uses for the phyloseq package is that it allows me to easily subset my dataset. In my work, I study the spatial variation of oral microbial communities. I have taken samples from all teeth from the mouths of just a handful of research subjects, but I have samples for certain teeth from all subjects. Phyloseq makes it easy for me to take a complete OTU Table, and subset it on only those teeth that were sampled in all subjects. Similarly, I can subset my OTU Table on a single bacterial phylum or on a single species, or on any covariate in my metadata mapping file, using a single line of R code.

5. Phyloseq enables the user to generate reproducible graphics

The authors of phyloseq created several custom ggplot2 (9) functions, enabling the phyloseq user, with just a few lines of code, to generate all of the most common graphics used in microbial census research (e.g., heatmaps, networks, ordination plots, phylogenetic trees, stacked bar plots for abundance measurements, etc.). Examples of these plots are shown in Figure 1 (though many other possibilities are supported, which can be seen here:

Fig 1A. The NMDS ordination plot  shows the separation of samples by weighted UniFrac distance for the Global Patterns dataset. Human-associated communities appear to cluster towards the right side of NMDS1 while non-human associated communities cluster towards the left.

Fig 1A. The NMDS ordination plot shows the separation of samples by weighted UniFrac distance for the Global Patterns dataset. Human-associated communities appear to cluster towards the right side of NMDS1 while non-human associated communities cluster towards the left.

Fig 1B. The dodged boxplots  show three alpha diversity metrics (Observed species, Chao1, and ACE) on the Y-axis with data classified on the X-axis as either a human-associated or a non-human associated microbial community.  This plot shows that non-human associated communities, in general, appear to be much more diverse than human-associated communities.

Fig 1B. The dodged boxplots show three alpha diversity metrics (Observed species, Chao1, and ACE) on the Y-axis with data classified on the X-axis as either a human-associated or a non-human associated microbial community. This plot shows that non-human associated communities, in general, appear to be much more diverse than human-associated communities.

6. Phyloseq allows covariate data to be visualized with the phylogenetic tree

In particular, phyloseq solves very well the problem of visualizing the phylogenetic tree – it allows the user to project covariate data (such as sample habitat, host gender, etc.) onto the phylogenetic tree, so that relationships between microbes, microbial communities, and the habitat from which they were derived can easily be seen. As an example, the relative abundance of taxa in samples was projected onto the phylogenetic tree along with the environment from which the samples were derived along with the bacterial order in Figure 2. I’ve not seen any other application that allows similar visualizations of the tree, and bootstrapping is also supported. For additional examples, refer to the phyloseq tutorial (

Fig 2: An example of using phyloseq to visualize phylogenetic trees along with covariate data using the Global Patterns dataset. In this figure, the sample type is shown in color, the shapes are bacterial Order, and the size of the shapes indicates the relative abundance of the taxon in the sample.

Fig 2: An example of using phyloseq to visualize phylogenetic trees along with covariate data using the Global Patterns dataset. In this figure, the sample type is shown in color, the shapes are bacterial Order, and the size of the shapes indicates the relative abundance of the taxon in the sample.

7. Data & code & results can be saved together to improve scientific reproducibility

One of the key features of phyloseq is that it provides researchers who have access to any system where R is able to run with a framework (i.e., R, R-markdown, Knitr & Rstudio) to perform reproducible statistical analysis of high throughput sequencing data. Using Phyloseq, Rstudio, R-markdown, and Knitr, it’s possible to see in a single .html file the data used to generate a set of figures alongside the code that was used to generate those figures. I now keep a collection of reproducible research reports as part of my lab notebook, and I look forward to being able to publish the final report for my first study along with my first scientific manuscript. For an example, please see the phyloseq tutorials, which were also generated using this approach (

8. Phyloseq is easy to learn.

When I first began working with Phyloseq, after taking a class taught by the author Susan Holmes, I knew some basic R commands from an undergraduate statistics class. Working with phyloseq made learning R easy for me. Since Phyloseq has a built in set of datasets one can use, it’s easy to reproduce the figures published in the phyloseq paper, as a stepping-stone for creating figures of one’s own.


An R-based package called Phyloseq makes it easy to analyze high throughput microbial census data, visualize the data, and perform reproducible statistical analysis. Phyloseq should facilitate conversations between researchers who publish data and the consumers of it with its emphasis on reproducible research. This should help those of us in the infancy of microbiome research ensure that our data do not disappear as quickly as Vine et al. currently predicts.

The paper was co-authored by Paul McMurdie and Professor Susan Holmes (in the picture).

The paper and the phyloseq package are co-authored by Paul McMurdie and Professor Susan Holmes (in the picture).


1.         Vines TH, Albert AY, Andrew RL, Debarre F, Bock DG, Franklin MT, et al. The Availability of Research Data Declines Rapidly with Article Age. Current biology : CB. 2013. doi: 10.1016/j.cub.2013.11.014. PubMed PMID: 24361065.

2.         Stromberg J. 2014.]. Available from:

3.         Noorden EGRV. Scientists losing data at a rapid rate. Nature News. 2014.

4.         bobmanyun. 2014 1-7-2014. [cited 2014]. Available from:

5.         McMurdie PJ, Holmes S. Phyloseq: a bioconductor package for handling and analysis of high-throughput phylogenetic sequence data. Pac Symp Biocomput. 2012:235-46. Epub 2011/12/17. PubMed PMID: 22174279; PubMed Central PMCID: PMC3357092.

6.         McMurdie PJ, Holmes S. phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS One. 2013;8(4):e61217. doi: 10.1371/journal.pone.0061217. PubMed PMID: 23630581; PubMed Central PMCID: PMC3632530.

7.         Thioulouse J. Simultaneous analysis of a sequence of paired ecological tables: A comparison of several methods. Ann Appl Stat. 2011;5(4):2300-25. doi: 10.1214/10-AOAS372.

8.         Holmes PJMS. Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissible. ARXIV. 2013. Epub 10/2013.

9.         Wickham H. ggplot2: Elegant Graphics for Data Analysis: Springer; 2009. 213 p.


Demographic inference from genomic data in nonmodel insect populations

Blog author Martin Sikora is a postdoc in the lab of Carlos Bustamante.

Blog author Martin Sikora is a postdoc in the lab of Carlos Bustamante.

Reconstructing the demographic history of species and populations is one of the major goals of evolutionary genetics. Inferring the timing and magnitude of past events in the history of a population is not only of interest in its own right, but also in order to form realistic null models for the expected patterns of neutral genetic variation in present-day natural populations. A variety of methods exist that allow the inference of these parameters from genomic data, which, in the absence of detailed historical records in most situations, is often the only feasible way to obtain them. As a consequence, it is generally not possible to empirically validate the parameters inferred from genomic data in a direct comparison with a known “truth” from a natural population. Furthermore, until recently, the application of these methods was limited to model organisms with well-developed genomic resources (e.g., humans and fruitflies), excluding a large number of non-model organisms with potentially considerable evolutionary and ecological interest.

Chasing butterflies?

In an elegant study recently published in the journal Molecular Ecology, Rajiv McCoy, a graduate student with Dmitri Petrov and Carol Boggs, and colleagues tackle both of these problems in natural populations of Euphydryas gillettii, a species of butterfly native to the northern Rocky Mountains. About 30 years ago, a small founder population of this species from Wyoming was intentionally introduced to a new habitat at the Rocky Mountain Biological Laboratory field site in Colorado, and population sizes were recorded every year since the introduction. The beauty of this system is that it allows the authors to perform a direct comparison of the known demography (i.e. a recent split from the parental population and bottleneck ~30 generations ago, with census data in the newly introduced population) with estimates inferred from genomic data.

Gillete’s Checkerspot (Euphydryas gillettii). Photo taken by Carol Boggs, co-advisor of Rajiv and one of the senior authors of the study.

Gillete’s Checkerspot (Euphydryas gillettii). Photo taken by Carol Boggs, co-advisor of Rajiv and one of the senior authors of the study.

A genomic dataset from a non-model organism

The researchers sampled eight larvae each from both the parental as well as the derived population for this study. In the world of model organisms, the next steps for constructing the dataset would be straightforward: Extract genomic DNA, sequence to the desired depth, map to the reference genome and finally call SNPs. In the case of E. gillettii however, no reference genome is available, so the authors had to use a different strategy. They decided to use RNA-sequencing in order to first build a reference transcriptome, which was then used as a reference sequence to map against and discover single nucleotide variants. An additional advantage of this approach is that the data generated can potentially also be utilized for other types of research questions, such as analyses of gene expression differences between the populations. On the downside, SNP calling from a transcriptome without a reference genome is challenging and can lead to false positives, for example due to reads from lowly expressed paralogs erroneously mapping to the highly expressed copy present in the assembled transcriptome. The authors therefore went to great lengths to stringently filter these false positive variants from their dataset.

Demographic inference using δαδι

For the demographic inference, McCoy and colleagues used δαδι (diffusion approximation for demographic inference), a method developed by Ryan Gutenkunst while he was a postdoc in the group of CEHG faculty member Carlos Bustamante. This method uses a diffusion approximation to calculate the expected allele frequency spectrum under a demographic model of interest. The observed allele frequency spectrum is then fit to the expected spectrum by optimization of the demographic parameters to maximize the likelihood of the data. δαδι has been widely used to infer the demographic history of a number of species, from humans to domesticated rice, and is particularly suited to large-scale genomic datasets due to its flexibility and computational efficiency.

Excerpt of Figure 2 from McCoy et al., illustrating the demographic models tested using δαδι.

Excerpt of Figure 2 from McCoy et al., illustrating the demographic models tested using δαδι.

Models vs History

The authors then fit a demographic model reflecting the known population history of E. gillettii, as illustrated in Figure 2 of their article (Model A). Encouragingly, they found that the model provided a very good fit to the data, with an the estimate of the split time between 40 and 47 generations ago, which is very close to the known time of establishment of the Colorado population 33 generations ago. Furthermore, they also tested how robust these results were to using a misspecified demographic model, by incorporating migration between the Colorado and Wyoming populations in their model (which in reality are isolated from each other). However, both alternative models with migration (Models B1 and B2) did not significantly improve the fit, again nicely consistent with the known population history.

Three butterflies is enough?

Finally, the researchers also tested the robustness of the results to variations in the number of samples or SNPs used in the analysis, from datasets simulated under the best-fit model A. They found that δαδι performed remarkably well even with sample sizes as low as three individuals per population. While this is in principle good news for researchers limited by low number of available samples, one has to be aware of the fact that this results will be to a certain extent specific to this particular type of system, where one population undergoes a very strong bottleneck resulting in large effects on the allele frequency spectrum. A good strategy suggested by McCoy and colleagues is then to use these types of simulations in the planning stages of an experiment, in order to inform researchers of the number of samples and markers necessary to confidently estimate the demographic parameters of interest.

Conclusions and future directions

For me, this study is a great example of how next-generation sequencing and sophisticated statistical modeling can open up a new world of possibilities to researchers interested in the ecology and evolution of natural populations. McCoy and colleagues constructed their genomic dataset essentially from scratch, without the “luxuries” of a reference genome or database of known polymorphisms. Moving forward, Rajiv has been busy collecting more samples over the past year. He and his colleagues plan to sequence over a thousand of them for the next phase of the project, as well as assemble a reference genome for E. gillettii, and important next step in the development of genomic tools for this fascinating ecological system.

The author of the paper Rajiv McCoy, sampling larvae of Euphydryas gillettii

The author of the paper Rajiv McCoy, sampling larvae of Euphydryas gillettii

McCoy, R. C., Garud, N. R., Kelley, J. L., Boggs, C. L. and Petrov, D. A. (2013), Genomic inference accurately predicts the timing and severity of a recent bottleneck in a nonmodel insect population. Molecular Ecology. doi: 10.1111/mec.12591

Which genetic variants determine histone marks?


Blog author Joe Davis is a graduate student with Stephen Montgomery & Carlos Bustamante.

The wealth of genetic variation in the human genome is found not within protein-coding genes but within non-protein coding regions. This comes as no surprise given that only 1% percent of the genome codes for proteins. Until recently, efforts to determine the effects of genetic variation on trait variation and disease have focused on coding regions. Results of genome-wide association studies (GWAS), however, have shown that trait and disease associated variants are often regulatory variants such as expression quantitative trait loci (eQTLs) found in non-coding regions. These results have spurred an effort to understand the functional role of non-coding, regulatory variation. Efforts have thus far relied on characterizing the association between variants and gene expression. This association alone, however, will not reveal the complete functional mechanism by which non-coding variants influence gene expression. Recent efforts have therefore begun to characterize numerous molecular phenotypes such as transcription factor (TF) binding, histone modification, and chromatin state to determine the mechanisms by which regulatory variants affect gene expression.

One issue, four papers

In the November 8 issue of Science, three papers were published that address the role of non-coding genetic variation on TF binding, histone modifications, and chromatin state (i.e. active versus inactive enhancer status). The first study was completed by the Dermitzakis Lab at the University of Geneva. They analyzed three TFs, RNA polymerase II (Pol II), and five histone modifications using chromatin immunoprecipitation and sequencing (ChIP-Seq) in lymphoblastoid cell lines (LCLs) from two parent-child trios [1]. The second was completed by the Pritchard Lab, which has recently moved to Stanford, and the Gilad Lab at the University of Chicago. They identified genetic variants affecting variation in four histone modifications and Pol II occupancy in ten unrelated Yoruba LCLs [2]. The third study was performed by the Snyder Lab at Stanford. They characterized the genetic variation underlying changes in chromatin state using RNA-Seq and ChIP-Seq for four histone modifications and two DNA binding factors in 19 LCLs from diverse populations [3]. This work was the subject of a recent CEHG Evolgenome talk given by Maya Kasowski, the study’s first author. Finally, the fourth study, published in the November 28 issue of Nature, was performed by the Glass Lab at UCSD. They characterized the effect of natural genetic variation between two mouse strains on the binding of two TFs involved in cell differentiation (PU.1 and C/EBPα) using ChIP-Seq [4]. In this post, I will analyze primarily the work presented by the Pritchard Lab, but I strongly recommend reading all four papers to understand the challenges in characterizing non-coding variation and the methods available to do so.


The four studies seek to answer the general question of how regulatory variation affects gene expression. They characterize diverse molecular phenotypes such as histone modifications and TF binding to understand the mechanisms of action for non-coding variants. The Pritchard Lab focused their study on four histone modifications (three active and one repressive: H3K4me3, H3K4me1, H3K27ac, and H3K27me3, respectively) and Pol II occupancy.


Histone modifications 101

Histone modifications refer to the addition of chemical groups such as methyl or acetyl to specific amino acids on the tails of histone proteins comprising the nucleosome. These chemical groups are referred to as histone marks. They can serve a wide range of functions, but in general they are associated with the accessibility of a chromatin region. For example, the tri-methylation of lysine 4 of histone 3 (H3K4me3) is associated with increased chromatin accessibility and gene activation. On the other hand, increased levels of the repressive mark H3K27me3 (tri-methylation of lysine 27 of histone 3) at promoters is associated with gene inactivation.

Histone mark levels are measured in a high-throughput manner using ChIP-Seq. Briefly, an antibody targeting the mark of interest is used to pull down modified genomic regions. These immunoprecipitated regions are then sequenced to determine which genomic segments are modified and at what level. The procedure usually requires a large number of cells (on the order of 10^7). Therefore, the modification level is, in some ways, a population level measurement. Analysis of ChIP-Seq data typically involves testing for genomic regions with more reads than expected by chance. These regions, ranging from 200bp to 1000bp or more, are referred to as peaks that represent a modification level above the genomic background. Repressive marks like H3K27me3 tend to have broad peak regions, while activating marks like H3K4me3 can have much tighter peaks.

Since modification levels represent measurements on a population of cells and histone residues can have multiple modifications, genomic regions can show evidence for multiple marks. The combinations of these marks over a region can mark the function of the region. For example, regions with high levels of H3K27ac and a high ratio of H3K4me1 to H3K4me3 can mark active enhancer regions. Until now, the variation of these marks between individuals and the genetic cause of this variation was uncharacterized. Moreover, the causal impact of these marks remains unknown. Do they alter gene expression directly or are they altered by gene regulation? Therefore, the two guiding questions for this study are:

1. What genetic variants influence histone modifications?

2. Are these modifications “a cause or a consequence of gene regulation?”

Variation in histone modifications, a real whodunit

The authors first seek to identify and characterize genetic variants that influence histone marks. They generated ChIP-Seq data for the four histone marks and Pol II in LCLs derived from ten unrelated Yoruba individuals who were previously genotyped as part of the 1000 Genomes Project. Similar studies of regulatory variants such as eQTL studies require large sample sizes to detect the effects of regulatory variants that often lie outside the gene. Unlike eQTL studies, histone marks cover fairly broad regions often encompassing causal regulatory variants. As a result, the authors can use a smaller sample size and still be confident about interrogating the effects of causal regulatory SNPs. The authors developed a statistical test that models total read depth between individuals and allelic imbalance between haplotypes within individuals to increase power to detect cis-QTLs (i.e. variants that affect histone marks and Pol II occupancy nearby in the genome). Using this method, they identified over 1200 distinct QTLs for histone marks and Pol II occupancy (FDR 20%).

The authors then analyze these histone mark and Pol II QTLs to determine the overlap of these variants with other known regulatory variants. The hypothesis is that regulatory variants that affect gene expression will have effects on diverse molecular phenotypes. Therefore, variants that influence histone marks and Pol II should show significant overlap with known regulatory variants such as eQTLs and DNase I sensitivity QTLs (dsQTLs). DNase I sensitivity is a measure of chromatin accessibility with higher sensitivity associated with higher accessibility. The Pritchard Lab mapped eQTLs and dsQTLs in a larger sample of ~75 Yoruban LCLs in two previous studies that I also recommend reading [5,6]. Their analysis revealed an enrichment of low p-values for dsQTLs and, to a lesser extent, eQTLs when tested as histone mark and Pol II QTLs. In addition, the authors observed a coordinated change in multiple molecular phenotypes at dsQTLs and eQTLs. For example, higher levels of the three histone active marks were observed at dsQTLs for the more DNase I sensitive genotype. At eQTLs, H3K4me3, H3K27ac, and Pol II levels were higher for individuals with the high expression genotype. These results show that non-coding regulatory variants impact multiple molecular phenotypes ranging from chromatin accessibility and transcription to histone modifications. The authors provide strong evidence in response to their first guiding question, namely that non-coding regulatory polymorphisms associate with variation in histone marks and Pol II.

TFs and a question of directionality

The authors then turned to addressing the questions of causality for these marks. To do so, they analyze genetic variants in TF binding sites. The main hypothesis is that regulatory variants that alter a TFBS will modify TF binding which will cause changes in histone mark and Pol II levels nearby. If this is the case, then changes in histone marks are a consequence of how strong the TF binding site is. On the other hand, if these marks were causal, polymorphisms in TF binding sites would not be expected to show strong association with changes in these marks.

To test their hypothesis, the authors examine ~11.5K TF binding sites with polymorphisms heterozygous in at least 1 of their 10 individuals. They calculate the change in position weight matrix (PWM) score between the two alleles for polymorphic TF binding sites within each individual. They then test for significant association between this change in PWM and allelic imbalance of ChIP-Seq reads at nearby heterozygous sites. The idea is that if a variant improves (or disrupts) TF binding for one allele at a TF binding site then active histone marks nearby on the same allele will increase (or decrease). Repressive histone marks (in this case H3K27me3) are expected to have the opposite response. Indeed, when they apply their test, they find a significant positive association for the active marks and a negative association for the repressive mark. This result supports the hypothesis of changes histone marks as a consequence of TF binding and gene regulation. However, this result does not rule out other possibilities. Histone marks can still play a causal role in the establishment of TF binding. In other words, the relationship between TF binding and histone marks does not have to be unidirectional. In addition, there is evidence that long non-coding RNAs may play a role in the establishment and regulation of histone marks.

dsQTLs and eQTLs, a match made on chromatin

In their final analysis, the authors examine dsQTLs that are also eQTLs. Since these variants associate with both gene expression and chromatin accessibility at distal regulatory regions (>5kb from associated TSS), the authors can assign the regulatory region to a specific gene. A variant that is both a dsQTL and an eQTL likely disrupts a distal regulatory region. In addition to disrupting the accessibility of the regulatory region, the variant also perturbs the expression of a gene influenced by the regulatory region. For example, a variant may decrease the chromatin accessibility of an enhancer region and thereby decrease the level of active histone marks for the enhancer. This decreased enhancer activity can result in decreased transcription from a nearby gene and similarly decreased active mark levels for the gene. Therefore, the hypothesis guiding this analysis is that variants influencing the histone marks of a distal regulatory region will have a coordinated effect on histone marks at genes under the control of the regulatory region. The authors examine the allelic imbalance in ChIP-Seq reads at regulatory regions and their associated transcription start sites (TSS). Indeed, the authors observe that variants that increase DNase I sensitivity have a significant positive allelic imbalance for active marks at both the regulatory region and the TSS. The opposite is true for the repressive mark. This result again emphasizes the complexity of gene regulation and the impact of non-coding variation. Not only do regulatory variants influence diverse molecular phenotypes nearby, they can direct changes at distal loci. As the authors note, this coordinated change in histone marks between distal regions possibly reflects the 3D organization of chromatin. Regulatory variants that impact chromatin looping interactions between distal regulatory regions and genes may cause changes in activity levels for both the gene and the regulatory region.


This paper provides clear evidence that regulatory variation has very complex impacts affecting multiple and diverse molecular phenotypes at multiple regions simultaneously. This complexity implies potentially numerous and diverse mechanisms by which regulatory variants act on gene regulation. The authors set out to find evidence for one of these mechanisms, namely perturbation of TF binding sites. They begin by showing that variation in histone modifications has a strong genetic basis and that the polymorphisms influencing these marks overlap with known regulatory variants such as eQTLs. They then show that polymorphisms in TF binding sites associate with changes in histone marks, providing evidence for directionality in the relationship between these marks and gene regulation. In essence, their results suggest that histone modifications are directed, at least in part, by TF binding. Finally, they find that regulatory variants can have an impact on the molecular phenotypes of distal regions.

I found this paper, as well as the other three previously mentioned, to be quite interesting. I think these papers show that our understanding of gene regulation is still very simplistic. With the advent of high-throughput molecular assays like ChIP-Seq and DNase-Seq, we can begin to interrogate the complex role of regulatory variation on many phenotypes. In doing so, it is of primary interest to ask questions regarding directionality. How do a given set of molecular phenotypes relate? Do these phenotypes represent a cause or a consequence of genome function? How do the diverse elements of gene regulation function together to build complex phenotypes?


[1] Kilpinen, H., Waszak, S. M., Gschwind, A. R., Raghav, S. K., Witwicki, R. M., Orioli, A., Dermitzakis, E. T., et al. (2013). Coordinated Effects of Sequence Variation on DNA Binding, Chromatin Structure, and Transcription. Science (New York, N.Y.), 744. doi:10.1126/science.1242463

[2] McVicker, G., van de Geijn, B., Degner, J. F., Cain, C. E., Banovich, N. E., Raj, A., Pritchard, J. K., et al. (2013). Identification of Genetic Variants That Affect Histone Modifications in Human Cells. Science (New York, N.Y.), 747. doi:10.1126/science.1242429

[3] Kasowski, M., Kyriazopoulou-Panagiotopoulou, S., Grubert, F., Zaugg, J. B., Kundaje, A., Liu, Y., Snyder, M., et al. (2013). Extensive Variation in Chromatin States Across Humans. Science (New York, N.Y.), 750. doi:10.1126/science.1242510

[4] Heinz, S., Romanoski, C. E., Benner, C., Allison, K. a, Kaikkonen, M. U., Orozco, L. D., & Glass, C. K. (2013). Effect of natural genetic variation on enhancer selection and function. Nature, 503(7477), 487–492. doi:10.1038/nature12615

[5] Pickrell, J. K., Marioni, J. C., Pai, A. a, Degner, J. F., Engelhardt, B. E., Nkadori, E., Pritchard, J. K., et al. (2010). Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature, 464(7289), 768–72. doi:10.1038/nature08872

[6] Degner, J. F., Pai, A. a, Pique-Regi, R., Veyrieras, J.-B., Gaffney, D. J., Pickrell, J. K., Pritchard, J. K., et al. (2012). DNase I sensitivity QTLs are a major determinant of human expression variation. Nature, 482(7385), 390–4. doi:10.1038/nature10808

Paper author Jonathan Pritchard is a professor in the Departments of Genetics and Biology.

Paper author Jonathan Pritchard is a professor in the Departments of Genetics and Biology.

Genomic analyses of ancestry of Caribbean populations

Blog author Rajiv McCoy is a graduate student in the lab of Dmitri Petrov.

Blog author Rajiv McCoy is a graduate student in the lab of Dmitri Petrov.

In the Author Summary of their paper, “Reconstructing the Population Genetic History of the Caribbean”, Andrés Moreno-Estrada and colleagues point out that Latinos are often falsely depicted as a homogeneous ethnic or cultural group.  In reality, however, Latinos, including inhabitants of the Caribbean basin, represent a diverse mixture of previously separate human populations, such as indigenous groups, European colonists, and West Africans brought over during the Atlantic slave trade.  This mixing process, which geneticists call “admixture”, left a distinct footprint on genetic variation within and between Caribbean populations.  By surveying genotypes of 330 Caribbean individuals and comparing to a database of variation from more than 3000 individuals from European, African, and Native American populations, Moreno et al., explore the genomic outcomes of this complex admixture process and reveal intriguing demographic patterns that could not be obtained from the historical record alone. The paper, featured in the latest edition of PLOS Genetics, represents a collaborative project with co-senior authorship by Stanford CEHG professor Carlos Bustamante and Professor Eden Martin from the University of Miami Miller School of Medicine.

Reconstructing the demographic history of admixed populations

Because parental DNA is only moderately shuffled before being incorporated into gametes (the process of meiotic recombination), admixture results in discrete genomic segments that can be traced to a particular ancestral population.  In early generations after the onset of admixture, these segments are large.  However, after many generations, segments will be quite small.  By investigating the distribution of sizes of these ancestry “tracts”, Moreno and colleagues inferred the timing of various waves of migration and admixture.  For Caribbean Island populations, they infer that European gene flow first occurred ~16-17 generations ago, which matches very closely to the historical record of ~500 years, assuming ~30 years per generation.  In contrast, for neighboring mainland populations from Colombia and Honduras, they find that European gene flow occurred in waves, starting more recently (~14 generations ago).

Identifying sub-continental ancestry of admixed individuals

Those familiar with human population genetics will recognize principal component analysis (PCA), which transforms a matrix of correlated observed genotypes into a set of uncorrelated variables where the first component explains the most possible variance, the second variable explains the second most variance, and so on.  Individuals’ transformed genotypes can be plotted on the first two principle components, and when performed on a worldwide scale, distinct clusters appear which represent populations of ancestry.  On conventional PCA plots, admixed individuals fall between their different ancestral populations, as they possess sets of genotypes diagnostic of multiple ancestral groups.  As virtually all Caribbean individuals are admixed to some degree, this pattern is apparent for Caribbean populations (see Figure 1B from the paper, reproduced below).


While interesting, this means that the sub-continental ancestry of these admixed individuals is difficult to ascertain.  An individual may want to know which Native American, West African, and European populations contribute to his or her ancestry, and this analysis does not have sufficient resolution to answer these questions.

Moreno and colleagues therefore devised a new version of PCA called ancestry-specific PCA (ASPCA), which extracts genomic segments assigned to Native American, West African, and European ancestry, then analyzes these segments separately, dealing with the large proportions of missing data that result.  In the case of Native American ASPCA, they observe two overlapping clusters.  The first represents mostly Colombians and Hondurans, who cluster most closely with indigenous groups from Western Colombia and Central America and have a greater overall proportion of Native American ancestry.  The second cluster represents mostly Cubans, Dominicans, and Puerto Ricans, who cluster most closely with Eastern Colombian and Amazonian indigenous groups.  This makes sense in light of the fact that Amazonian populations from the Lower Orinoco Valley settled on rivers and streams, which could have facilitated their migration.  Because indigenous ancestry proportions were relatively consistent and closely clustered across different Caribbean Islands, the authors posit that there was a single pulse of expansion of Amazonian natives across the Caribbean prior to European arrival, along with gene flow among the islands.

In the case of European ASPCA, Moreno et al. found that Caribbean samples clustered closest to, but clearly distinct from, present day individuals from the Iberian Peninsula in Southern Europe.  In fact, the differentiation between this “Latino-specific component” and Southern Europe is at least as great as the differentiation between Northern and Southern Europe.  The authors hypothesize that this is due to very small population sizes among European colonists, which would have introduced noise into patterns of genomic variation through the process of random genetic drift.

Finally, the authors demonstrate that Caribbean populations have a higher proportion of African ancestry compared to mainland American populations, a result of admixture during and after the Atlantic slave trade.  Surprisingly, the authors found that all samples tightly clustered with present day Yoruba samples from Nigeria rather than being dispersed throughout West Africa.  However, because other analyses suggested that there might have been two major waves of migration from West Africa, the authors decided to analyze “old” and “young” blocks of African ancestry separately.  This analysis revealed that “older” segments are primarily derived from groups from the Senegambia region of Northwest Africa, while “younger” segments likely trace to groups from the Gulf of Guinea and Equatorial West Africa (including the Yoruba).

Conclusions and perspectives

This groundbreaking study has immediate implications for the field of personalized medicine, especially due to the discovery of a distinct Latino-specific component of European ancestry.  The hypothesis that European colonists underwent a demographic bottleneck (a process termed the “founder effect”) has expected consequences for the frequency of damaging mutations contributing to genetic disease. The observation of extensive genetic differences among Caribbean populations also argues for more such studies characterizing genetic variation on a smaller geographic scale. The newly developed ASPCA method will surely be valuable for other admixed populations.  In addition to medical implications, studies such as this help dispel simplistic notions of race and ethnicity and inform cultural identities based on unique and complex demographic history.

Citation: Moreno-Estrada A, Gravel S, Zakharia F, McCauley JL, Byrnes JK, et al. (2013) Reconstructing the Population Genetic History of the Caribbean. PLoS Genet 9(11): e1003925. doi:10.1371/journal.pgen.1003925

Paper author Andres Moreno-Estrada is a research associate in the lab of Carlos Bustamante.

Paper author Andrés Moreno-Estrada is a research associate in the lab of Carlos Bustamante.

How recombination and changing environments affect new mutations

Blog author: David Lawrie was a graduate student in Dmitri Petrov’s lab. He is now a postdoc at USC.

I recently sat down with Oana Carja, a graduate student with Marc Feldman, to discuss her paper published in the journal of Theoretical Population Biology entitled “Evolution with stochastic fitnesses: A role for recombination”. In it, the authors Oana Carja, Uri Liberman, and Marcus Feldman explore when a new mutation can invade an infinite, randomly mating population that experiences temporal fluctuations in selection.

The one locus case

This work builds off of previous research in the field on how the fluctuations in fitness over time (i.e., increased variance of fitness) affect the invasion dynamics of a mutation at a single locus. For a single locus, it has been shown that the geometric mean of the fitness of the allele over time determines the ability of an allele to invade a population. This effect is known as the geometric mean principle. Fluctuations in fitness increase the variance and therefore decrease the geometric mean fitness. The variance of the fitness of the allele over time thus greatly impacts the ability of that allele to invade a population.

What if there are two loci?

In investigating a two locus model, the researchers split the loci by their effect on the temporally-varying fitness: one locus only affects the mean, while the other controls the variance. The authors demonstrate through theory and simulation that:

1)    allowing for recombination between the two loci increases the threshold for the combined fitness of the two mutant alleles to invade the population beyond the geometric mean (see figure).

2)    periodic oscillations in the fitness of the alleles over time lead to higher fitness thresholds for invasion over completely random fluctuations (see figure).

3)    edge case scenarios allow for the maintenance of polymorphisms in the population despite clear selective advantages of a subset of allelic combinations.

Temporally changing environments and recombination thus make it overall more difficult for new alleles to invade a population.

Invasibility thresholds as a function of recombination rate. Recombination makes it more difficult for new alleles to invade a population.

Invasibility thresholds as a function of recombination rate. If there is no recombination (the left-most edge of the figure), the geometric mean of the pair of new alleles needs to be higher than 0.5 to allow for invasion, because the resident alleles’ geometric mean fitness is set to 0.5. However, as recombination between the two loci increases, the geometric mean needed for invasion increases rapidly. If there is free recombination (r = 0.5) then  invasion can only happen if the new alleles’ geometric mean fitness is twice the resident alleles’ geometric mean fitness (light grey area). If the environment is changing periodically, it is even harder for new alleles to invade a population (dark grey area).

The evolution of models of evolution

This work is important for addressing the evolutionary dynamics of loci controlling phenotypic variance – in this case, controlling the ability of a phenotype to maintain its fitness even if the environment is variable. Most environments undergo significant temporal shifts from the simple changing of the seasons to larger scale weather changes such as El Niño and climate change, in which species must survive and thrive. For organisms in the wild, many alleles that confer a benefit in one environment will be deleterious when the environment and selective pressures change. There may be modifier-loci which buffer the fitness of those loci in the face of changing environments. Such modifier-loci have been recently found in GWAS studies and may be important for overall phenotypic variance. Thus modeling the patterns of evolution for multiple loci in temporarily varying environments is a key component to advancing our understanding of the patterns found in nature.

Future work

Epigenetic modifiers are a hot area of research and one potential biological mechanism to control phenotypic variance. The evolution of such epigenetic regulation is a particular research interest of Oana. Future work will continue to explore the evolutionary dynamics of epigenetic regulation and focus on applying the above results to finite populations.

Paper author: Oana Carja is a graduate student with Marc Feldman


Oana Carja, Uri Liberman, Marcus W. Feldman, Evolution with stochastic fitnesses: A role for recombination, Theoretical Population Biology, Volume 86, June 2013, Pages 29-42, ISSN 0040-5809.

We sequence dead people

Blog-author: Sandeep Venkataram is a grad student in Dmitri Petrov’s lab.

By Sandeep Venkataram – Modern humans have been evolving independently of our nearest living relatives, chimpanzees, for over 7 million years. To study our evolutionary history since this divergence, our major source of information is from the fossilized remains of our ancestors and closely related species such as Neanderthals. Physiological information from the remains can tell us a lot about human evolution, but the majority of the information is locked up in the tiny amounts of highly degraded and fragmented DNA left from the specimen.

Studying ancient DNA (aDNA) from bones is extremely challenging. Each sample contains not only the fossil’s DNA, but contaminating DNA from enormous numbers of microbes and other organisms. There is also the possibility of human contamination from handling the fossil. Since the contaminants have much higher quality DNA than the endogenous sample, simply processing a raw DNA extract of the sample typically yields sequencing results with less than 1% aDNA. Sequencing such a sample is incredibly inefficient, as less than 1% of the sequence is actually useful and only the best funded studies can afford the sequencing costs to generate a high coverage genome. Therefore, most aDNA studies tend to focus on mitochondrial DNA using targeted capture methods, or PCR amplicon sequencing. This reduces wasted sequence capacity, but greatly limits the amount of information obtained from the sample.

Getting the most out of ancient DNA

Meredith Carpenter, who is a postdoc here at Stanford, and her colleagues have developed a novel method called whole genome in-solution capture (WISC) to purify aDNA from next-gen sequencing libraries generated from fossil DNA samples. The authors make use of RNA bait technology (Figure 1), which uses RNA complementary to the targeted DNA that has been synthesized using biotinylated nucleotides (Gnirke et al 2009). The RNA can be hybridized to the DNA pool, and purified from solution by linking the biotin to commercially available purification beads. By then eluting the hybrids from the beads and removing the RNA from solution, one can greatly enrich for the targeted DNA. Previous methods using RNA baits required synthesizing DNA strands on microarrays that are complementary to the RNA, then inducing transcription in vitro to generate the RNA bait library. This methodology has been successfully used to capture the entirety of chromosome 21 (Fu et al 2013) from human fossils, but is not cost effective for producing a bait library that covers the entire human genome.

Meredith Carpenter and coauthors circumvent this by mechanically fragmenting human genomic DNA and use blunt end ligation to attach adapter sequences containing an RNA polymerase promoter. This modified DNA library can then be used to generate the biotinylated RNA bait library via in vitro transcription, after which the RNA library is purified using standard protocols. The authors also generate non-biotinylated RNA complementary to the adapter sequences present on every DNA fragment in the library, to block nonspecific binding to the adapters. The bait and block RNA libraries are hybridized to the aDNA libraries, purified using beads to select only the aDNA fragments and sequenced. The cost of their enrichment method is estimated at $50 per sample, and is accessible to most labs conducting aDNA genomic studies.

WISC greatly enriches for ancient DNA across a variety of samples

The authors tested this method on a variety of aDNA libraries prepared from both high quality and low quality samples, including hair remains, teeth and bones, and fossils from tropical and more temperate regions, which can greatly influence DNA quality. They sequenced the libraries both before and after WISC, and found a 3-13x increase in the number of uniquely mapped reads after using WISC. In addition, most of the unique reads in the enriched library are sequenced with five million reads in both hair and bone samples. WISC allows most of the endogenous sequence to be read from dozens of aDNA samples in a single lane of Illumina HiSeq, opening the possibility of sequencing the millions of fossils in museums and collections around the world.

Dr. Carpenter says they are now focusing on adapting the method to removing human DNA contamination from microbiome sequencing projects with promising preliminary results, as well as applications in forensics and studying extinct species. As WISC generates the RNA bait library from genomic DNA of an extant relative instead of synthetic DNA arrays, bait libraries can be prepared regardless of whether the genome of the organism that is the source of the bait library is known. Combined with recent advances in aDNA library construction methods (Meyer et al 2012), WISC promises to make sequencing of contaminated and degraded samples widely accessible.

Carpenter et al (2013) Figure 1. Schematic of the Whole-Genome In-Solution Capture Process
To generate the RNA “bait” library, a human genomic library is created via adapters containing T7 RNA polymerase promoters (green boxes). This library is subjected to in vitro transcription via T7 RNA polymerase and biotin-16-UTP (stars), creating a biotinylated bait library. Meanwhile, the ancient DNA library (aDNA “pond”) is prepared via standard indexed Illumina adapters (purple boxes). These aDNA libraries often contain <1% endogenous DNA, with the remainder being environmental in origin. During hybridization, the bait and pond are combined in the presence of adaptor-blocking RNA oligos (blue zigzags), which are complimentary to the indexed Illumina adapters and thus prevent nonspecific hybridization between adapters in the aDNA library. After hybridization, the biotinylated bait and bound aDNA is pulled down with streptavidin-coated magnetic beads, and any unbound DNA is washed away. Finally, the DNA is eluted and amplified for sequencing.

Paper author Meredith Carpenter

Paper author Meredith Carpenter is a postdoc in Carlos Bustamante’s lab.


Carpenter, M. L., Buenrostro, J. D., Valdiosera, C., Schroeder, H., Allentoft, M. E., Sikora, M., Rasmussen, M., et al. (2013). Pulling out the 1%: Whole-Genome Capture for the Targeted Enrichment of Ancient DNA Sequencing Libraries. The American Journal of Human Genetics, 1–13. doi:10.1016/j.ajhg.2013.10.002

Fu, Q., Meyer, M., Gao, X., Stenzel, U., Burbano, H. a, Kelso, J., & Pääbo, S. (2013). DNA analysis of an early modern human from Tianyuan Cave, China. Proceedings of the National Academy of Sciences of the United States of America, 110(6), 2223–7. doi:10.1073/pnas.1221359110

Gnirke, A., Melnikov, A., Maguire, J., Rogov, P., LeProust, E. M., Brockman, W., Fennell, T., et al. (2009). Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nature biotechnology, 27(2), 182–9. doi:10.1038/nbt.1523

Meyer, M., Kircher, M., Gansauge, M.-T., Li, H., Racimo, F., Mallick, S., Schraiber, J. G., et al. (2012). A high-coverage genome sequence from an archaic Denisovan individual. Science (New York, N.Y.), 338(6104), 222–6. doi:10.1126/science.1224344

Update: The second paragraph originally talked about fossils, but it should have been bones (now corrected). A fossil is mineralized and would not yield aDNA. 

Missing the forest for the trees: How frequent adaptation can confound its own inference


Blog author: Fabian Staubach was a postdoc in Dmitri Petrov’s lab and is now an assistant professor in Freiburg, Germany.

This post was written by Fabian Staubach. 

The neutral theory of molecular evolution assumes that adaptation is rare and that the effect of adaptation on linked variation, the so-called hitchhiking effect, typically has only little influence on the dynamics of molecular genetic variation. Because of this assumption, it is widely assumed that in most natural populations, hitchhiking can be neglected, or at least reasonably well approximated by the introduction of effective parameters, such as an effective population size. But if molecular adaptation is in fact common, then the assumption may be violated, and we should worry whether population genetic methods and estimates of evolutionary parameters obtained from them are robust to frequent hitchhiking.

In their paper “Frequent adaptation and the McDonald-Kreitman test” (PNAS, 2013), Philipp Messer and Dmitri Petrov investigate this question for one of the key population genetic methods — the McDonald-Kreitman (MK) test. This test forms the basis of most commonly used approaches to measure the rate of adaptation from population genomic data and has been used to argue that in some organisms, such as Drosophila, the rate of adaptation is surprisingly high.

The MK test can substantially underestimate the true rate of adaptation

Messer and Petrov employ their powerful forward simulation software, SLiM (see here), to simulate the evolution of entire chromosomes under a range of parameter values relevant to humans and other organisms, and apply various forms of the MK test to the population genomic data resulting from their simulations. They then study how accurately these methods re-infer the true evolutionary parameters in the simulations. Strikingly, they find that the MK test can substantially underestimate the true rate of adaptation. For instance, they present scenarios where 40% of the amino acid changing substitutions were in fact strongly adaptive in the simulations, while other population parameters resembled those commonly inferred for human evolution, yet the standard MK estimates yield that none of these substitutions were actually adaptive. Fortunately, Messer and Petrov propose a way to avoid these problems by using a simple, asymptotic extension of the MK test.

Figure: Illustration of the asymptotic MK estimation of the rate of adaptive substitutions : The standard MK approach assumes that all polymorphisms (non-synonymous and synonymous) are neutral. This assumption is likely violated for low frequency polymorphisms, as some of these are likely to be deleterious. The assumption should hold for very high frequency polymorphisms, because they are very unlikely to be deleterious. The asymptotic MK approach uses this fact by looking at the estimated rate from different frequency classes of alleles, and extrapolating to x=1, where the rate is expected to have asymptoted.  

The bigger claim of this straightforward and easy-to-read paper is that the effects of linked selection cannot be simply swept under the rug by introducing effective parameters, such as effective population size or effective strength of selection, and then using these effective parameters in formulae derived from the diffusion approximation under the assumption of free recombination.

Quantifying known biases

Surely, this paper will ruffle some feathers. Some people will argue that these problems have been know for a while in theory. Yet despite this, the vast majority of studies that continue to appear in the literature still pay only cursory lip service, if anything, to these issues. Presumably, this is because it is not well understood analytically to what extent linkage effects affect population genetic estimates, and Messer and Petrov therefore do an important job in quantifying these biases. Hopefully this will help focus the community’s attention to spend some time figuring out how to modify commonly used approaches to place them on a more solid foundation.

Citation: Messer, P. W., & Petrov, D. A. (2013). Frequent adaptation and the McDonald-Kreitman test. Proceedings of the National Academy of Sciences of the United States of America, 110(21), 8615–20. doi:10.1073/pnas.1220835110


Paper author: Philipp Messer is a research associate in Dmitri Petrov’s lab at Stanford, where he studies the population genetics of adaptation using theoretical and computational approaches in concert with the analysis of large-scale population genomic data.