Imagining Phylogenetics and Recombination as Art

DAFAbout the Artist:
Daniel Friedman is a 1st year Ph.D. student in the Ecology and Evolution program. Working from the Gordon lab, he mainly studies the evolution of collective behavior in ants. Other research interests include fractals, burritos, and metaphors. Contact:


1. Phylogenetics
Ever since “I Think…”, the idea of a bifurcating tree of species relations has guided evolutionary biology. This piece of paper with ink on it plays with the idea of an “evolutionary I”, styled as an evolving Eye. Whether we perform molecular studies on the ontogenetic role of Pax6, or psychophysical explorations into the Self, we are confronted with questions of homology and convergence. Our time-reversible phylogenetic algorithms, so designed for computational simplicity, only contribute to this problem of post hoc ergo propter hoc – “after this, therefore because of this.” The Modern Synthesis was clearly a rEvolutionary moment – now are we ready for a Post-Modern Synthesis?
2. Recombination
DNA recombination is key to many biological processes. Recombination between homologous chromosomes during meiosis creates novel combinations of alleles, and to many, is the teleological “Why?” of Sex. But the reach of recombination goes far, far beyond Sex. Recombination between alleles of the same locus allows a kaleidoscope of DNA error-correcting mechanisms to proceed. And over evolutionary time scales, “errors” in recombination provide large structural creativities in the genome, such as duplication, deletions, and inversions. Recombination during immune cell maturation allows the human body to recognize an essentially infinite cohort of potential invaders. And now that recombination has been mechanistically deconvoluted, derived technologies facilitate guided DNA editing in vitro and in vivo . Recombination is molecular innovation embodied, a topological whirligig, and the workhorse of the genome.

A fast and accurate coalescent approximation

Blog author Suyash Shringarpure is a postdoc in Carlos Bustamante's lab.

Blog author Suyash Shringarpure is a postdoc in Carlos Bustamante’s lab. Suyash is interested in statistical and computational problems involved in the analysis of biological data.

The coalescent model is a powerful tool in the population geneticist’s toolbox. It traces the history of a sample back to its most recent common ancestor (MRCA) by looking at coalescence events between pairs of lineages. Starting from assumptions of random mating, selective neutrality, and constant population size, the coalescent uses a simple stochastic process that allows us to study properties of genealogies, such as the time to the MRCA and the length of the genealogy, analytically and through efficient simulation. Extensions to the coalescent allow us to incorporate effects of mutation, recombination, selection and demographic events in the coalescent model. A short introduction to the coalescent model can be found here and a longer, more detailed introduction can be read here.

However, coalescent analyses can be slow or suffer from numerical instability, especially for large samples. In a study published earlier this year in Theoretical Population Biology, CEHG fellow Ethan Jewett and CEHG professor Noah Rosenberg proposed fast and accurate approximations to general coalescent formulas and procedures for applying such approximations. Their work also examined the asymptotic behavior of existing coalescent approximations analytically and empirically.

Computational challenges with the coalescent

For a given sample, there are many possible genealogical histories, i.e., tree topologies and branch lengths, which are consistent with the allelic states of the sample. Analyses involving the coalescent therefore often require us to condition on a specific genealogical property and then sum over all possible genealogies that display the property, weighted by the probability of the genealogy. A genealogical property that is often conditioned on is n_t, the number of ancestral lineages in the genealogy at a time t in the past. However, computing the distribution P(n_t) of n_t is computationally expensive for large samples and can suffer from numerical instability.

A general approximation procedure for formulas conditioning on n_t

Coalescent formulas conditioning on n_t typically involve sums of the form f(x)=\sum_{n_t} f(x|n_t) \cdot P(n_t)

For large samples and recent times, these computations have two drawbacks:

–       The range of possible values for n_t may be quite large (especially if multiple populations are being analyzed) and a summation over these values may be computationally expensive.

–       Expressions for P(n_t) are susceptible to round-off errors.

Slatkin (2000) proposed an approximation to the summation in f(x) by a single term f(x|E[n_t]). This deterministic approximation was based on the observation that n_t changes almost deterministically over time, even though it is a stochastic variable in theory. Thus we can write n_t \approx E[n_t]. From Figure 2 in the paper (reproduced here), we can see that this approximation is quite accurate. The authors prove the asymptotic accuracy of this approximation and also prove that under regularity assumptions, f(x|E[n_t]) converges to f(x) uniformly in the limits of t \rightarrow 0 and t \rightarrow \infty . This is an important result since it shows that the general procedure produces a good approximation for both very recent and very ancient history of the sample. Further, the paper shows how this method can be used to approximate quantities that depend on the trajectory of n_t over time, which can be used to calculate interesting quantities such as the expected number of segregating sites in a genealogy.

Approximating E[n_t] for single populations

A difficulty with using the deterministic approximation is that E[n_t] often has no closed-form formula, and if one exists, it is typically not easy to compute when the sample is large.

For a single population with changing size, two deterministic approximations have previously been developed (one by Slatkin and Rannala 1997, Volz et al. 2009 and one by Frost and Volz, 2010, Maruvka et al., 2011). Using theoretical and empirical methods, the authors examine the asymptotic behavior and computational complexity of these approximations and a Gaussian approximation by Griffiths. A summary of their results is in the table below.

Method Accuracy
Griffith’s approximation Accurate for large samples and recent history.
Slatkin and Rannala (1997), Volz et al. (2009) Accurate for recent history and arbitrary sample size, inaccurate for very ancient history.
Frost and Volz (2010), Maruvka et al. (2011) Accurate for both recent and ancient history and for arbitrary sample size.
Jewett and Rosenberg (2014) Accurate for both recent and ancient history and arbitrary sample size, and for multiple populations with migration.


Approximating E[n_t] for multiple populations

Existing approaches only work for single populations of changing size and cannot account for migration between multiple populations. Ethan and Noah extend the framework for single populations to allow multiple populations with migration. The result is a system of simultaneous differential equations, one for each population. While it does not allow for analytical solutions except in very special cases, the system can be easily solved numerically for any given demographic scenario.

Significance of this work

The extension of the coalescent framework to multiple populations with migration is an important result for demographic inference. The extended framework with multiple populations allows efficient computation of demographically informative quantities such as the expected number of private alleles in a sample, divergence times between populations.

Ethan and Noah describe a general procedure that can be used to approximate coalescent formulas that involve summing over distributions conditioned on n_t or the trajectory of n_t over time. This procedure is particularly accurate for studying very recent or very ancient genealogical history.

The analysis of existing approximations to E[n_t] show that different approximations have different asymptotic behavior and computational complexities. The choice of which approximation to use is therefore often a tradeoff between the computational complexity of the approximation and the likely behavior of the approximation in the parameter ranges of interest.

Future Directions

As increasingly large genomic samples from populations with complex demographic histories become available for study, exact methods either become intractable or very slow. This work adds to a growing set of approximations to the coalescent and its extensions, joining other methods such as conditional sampling distributions and the sequentially markov coalescent. Ethan and Noah are already exploring applications of these approximate methods to reconciling gene trees with species trees. In the future, I expect that these and other approximations will be important for fast and accurate analysis of large genomic datasets.


[1] Jewett, E. M., & Rosenberg, N. A. (2014). Theory and applications of a deterministic approximation to the coalescent model. Theoretical population biology.

[2] Griffiths, R. C. (1984). Asymptotic line-of-descent distributions. Journal of Mathematical Biology21(1), 67-75.

[3] Frost, S. D., & Volz, E. M. (2010). Viral phylodynamics and the search for an ‘effective number of infections’. Philosophical Transactions of the Royal Society B: Biological Sciences365(1548), 1879-1890.

[4] Maruvka, Y. E., Shnerb, N. M., Bar-Yam, Y., & Wakeley, J. (2011). Recovering population parameters from a single gene genealogy: an unbiased estimator of the growth rate. Molecular biology and evolution28(5), 1617-1631.

[5] Slatkin, M., & Rannala, B. (1997). Estimating the age of alleles by use of intraallelic variability. American journal of human genetics60(2), 447.

[6] Slatkin, M. (2000). Allele age and a test for selection on rare alleles.Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences355(1403), 1663-1668.

[7] Volz, E. M., Pond, S. L. K., Ward, M. J., Brown, A. J. L., & Frost, S. D. (2009). Phylodynamics of infectious disease epidemics. Genetics183(4), 1421-1430.

Paper author Ethan Jewett is a PhD student in the lab of Noah Rosenberg.

Paper author Ethan Jewett is a PhD student in the lab of Noah Rosenberg.

Using phyloseq for the reproducible analysis of high-throughput sequencing data in microbial ecology

Blog author Diana Proctor is a graduate student in David Relman’s lab.

Blog author Diana Proctor is a graduate student in David Relman’s lab.

The Problem: Data Availability & Scientific Reproducibility

A Current Biology (1) paper evaluating the accessibility of scientific data recently inspired articles and blog posts (2, 3) as well as a lively conversation on Reddit about the “alarming rate of data disappearance” (4). Solutions to the problem of disappearing data include the NIH data sharing policy, as well as data sharing policies set by scientific journals, requiring the deposition of data into public repositories.

As a trainee in David Relman’s Lab thinking about the eventual fate of the high-throughput, next generation sequencing data generated over the course of my dissertation (, this conversation about data accessibility brings to mind a related question – how can I ensure that my data are not lost as fast as the current biology paper predicts?

The Solution: Phyloseq allows microbial ecologists to make reproducible research reports

The solution to data disappearance probably needs to involve not only deposition of data into public repositories, but also the widespread use of reproducible research reports. Luckily for those microbial ecologists among us, Paul McMurdie and Susan Holmes of Stanford University developed an R-based Bioconductor package (i.e., a package for bioinformatics) called phyloseq to facilitate the reproducible statistical analysis of high throughput phylogenetic sequencing datasets, including those generated by barcoded amplicon sequencing, metabolomic, and metagenomic experiments (5, 6). Phyloseq, initially released in 2012, was recently updated by McMurdie & Holmes, and described in an April 2013 publication (6).

Phyloseq Key Features

Phyloseq allows the user to import a species x sample data matrix (aka, an OTU Table) or data matrices from metagenomic, metabolomic, and/or other –omics type experiments into the R computing environment. Previous R extensions, such as OTUbase, also have the capacity to import these data matrices into R, but phyloseq is unique in that it allows the user to integrate the OTU Table, the phylogenetic tree, the “representative sequence” fasta file, and the metadata mapping file into a single “phyloseq-class” R object. The microbial ecologist can then harness all the statistical and graphical tools available in R, including Knitr, R-Markdown and ggplot2 (among others), to generate reproducible research reports with beautiful graphics, as detailed below. To see the report McMurdie used to prepare the phyloseq publication, visit this link:

1. Phyloseq incidentally allows the user to curate data

When phyloseq imports the myriad phylogenetic sequencing data objects into R, it scrutinizes the data, making sure that the OTU Table matches the metadata mapping file, the phylogenetic tree, and the representative sequence labels. If not, the user gets an error. If the data descriptors are congruent, a single phyloseq object can be created, which can then be saved along with the R code used to create the object. I have found that this enables me to curate my data – consolidating all the data objects (OTU Table, mapping file, phylogenetic tree, etc.) describing a single experiment into a single multi-level data frame.

2. Phyloseq gives the user the analytical power of R

Importantly, by importing data into the R-computing environment, one may easily perform beta diversity analysis using any or all of over 40 different ecological distance metrics before performing virtually any ordination under the sun. Several alpha diversity metrics are implemented in phyloseq, as well. Finally, after getting the data into R, it’s easy to perform more sophisticated analyses than has previously been possible with this type of dataset, such as k-tables analysis (7), using R’s repertoire of extension packages.

3. Phyloseq makes standardization of sequence data pretty simple

Of particular note, the authors have included in phyloseq several methods to standardize and/or normalize high throughput sequence data. Most of us of course realize the need for data standardization (as evidenced by our reliance on rarefaction), but the tools to easily standardize data, aside from rarefaction, have been lacking (8). The authors of phyloseq have equipped us with several methods (one new!) to standardize our microbial census data, as well as the code needed to accomplish the task (

4. Phyloseq makes subsetting large datasets easy

One of my favorite uses for the phyloseq package is that it allows me to easily subset my dataset. In my work, I study the spatial variation of oral microbial communities. I have taken samples from all teeth from the mouths of just a handful of research subjects, but I have samples for certain teeth from all subjects. Phyloseq makes it easy for me to take a complete OTU Table, and subset it on only those teeth that were sampled in all subjects. Similarly, I can subset my OTU Table on a single bacterial phylum or on a single species, or on any covariate in my metadata mapping file, using a single line of R code.

5. Phyloseq enables the user to generate reproducible graphics

The authors of phyloseq created several custom ggplot2 (9) functions, enabling the phyloseq user, with just a few lines of code, to generate all of the most common graphics used in microbial census research (e.g., heatmaps, networks, ordination plots, phylogenetic trees, stacked bar plots for abundance measurements, etc.). Examples of these plots are shown in Figure 1 (though many other possibilities are supported, which can be seen here:

Fig 1A. The NMDS ordination plot  shows the separation of samples by weighted UniFrac distance for the Global Patterns dataset. Human-associated communities appear to cluster towards the right side of NMDS1 while non-human associated communities cluster towards the left.

Fig 1A. The NMDS ordination plot shows the separation of samples by weighted UniFrac distance for the Global Patterns dataset. Human-associated communities appear to cluster towards the right side of NMDS1 while non-human associated communities cluster towards the left.

Fig 1B. The dodged boxplots  show three alpha diversity metrics (Observed species, Chao1, and ACE) on the Y-axis with data classified on the X-axis as either a human-associated or a non-human associated microbial community.  This plot shows that non-human associated communities, in general, appear to be much more diverse than human-associated communities.

Fig 1B. The dodged boxplots show three alpha diversity metrics (Observed species, Chao1, and ACE) on the Y-axis with data classified on the X-axis as either a human-associated or a non-human associated microbial community. This plot shows that non-human associated communities, in general, appear to be much more diverse than human-associated communities.

6. Phyloseq allows covariate data to be visualized with the phylogenetic tree

In particular, phyloseq solves very well the problem of visualizing the phylogenetic tree – it allows the user to project covariate data (such as sample habitat, host gender, etc.) onto the phylogenetic tree, so that relationships between microbes, microbial communities, and the habitat from which they were derived can easily be seen. As an example, the relative abundance of taxa in samples was projected onto the phylogenetic tree along with the environment from which the samples were derived along with the bacterial order in Figure 2. I’ve not seen any other application that allows similar visualizations of the tree, and bootstrapping is also supported. For additional examples, refer to the phyloseq tutorial (

Fig 2: An example of using phyloseq to visualize phylogenetic trees along with covariate data using the Global Patterns dataset. In this figure, the sample type is shown in color, the shapes are bacterial Order, and the size of the shapes indicates the relative abundance of the taxon in the sample.

Fig 2: An example of using phyloseq to visualize phylogenetic trees along with covariate data using the Global Patterns dataset. In this figure, the sample type is shown in color, the shapes are bacterial Order, and the size of the shapes indicates the relative abundance of the taxon in the sample.

7. Data & code & results can be saved together to improve scientific reproducibility

One of the key features of phyloseq is that it provides researchers who have access to any system where R is able to run with a framework (i.e., R, R-markdown, Knitr & Rstudio) to perform reproducible statistical analysis of high throughput sequencing data. Using Phyloseq, Rstudio, R-markdown, and Knitr, it’s possible to see in a single .html file the data used to generate a set of figures alongside the code that was used to generate those figures. I now keep a collection of reproducible research reports as part of my lab notebook, and I look forward to being able to publish the final report for my first study along with my first scientific manuscript. For an example, please see the phyloseq tutorials, which were also generated using this approach (

8. Phyloseq is easy to learn.

When I first began working with Phyloseq, after taking a class taught by the author Susan Holmes, I knew some basic R commands from an undergraduate statistics class. Working with phyloseq made learning R easy for me. Since Phyloseq has a built in set of datasets one can use, it’s easy to reproduce the figures published in the phyloseq paper, as a stepping-stone for creating figures of one’s own.


An R-based package called Phyloseq makes it easy to analyze high throughput microbial census data, visualize the data, and perform reproducible statistical analysis. Phyloseq should facilitate conversations between researchers who publish data and the consumers of it with its emphasis on reproducible research. This should help those of us in the infancy of microbiome research ensure that our data do not disappear as quickly as Vine et al. currently predicts.

The paper was co-authored by Paul McMurdie and Professor Susan Holmes (in the picture).

The paper and the phyloseq package are co-authored by Paul McMurdie and Professor Susan Holmes (in the picture).


1.         Vines TH, Albert AY, Andrew RL, Debarre F, Bock DG, Franklin MT, et al. The Availability of Research Data Declines Rapidly with Article Age. Current biology : CB. 2013. doi: 10.1016/j.cub.2013.11.014. PubMed PMID: 24361065.

2.         Stromberg J. 2014.]. Available from:

3.         Noorden EGRV. Scientists losing data at a rapid rate. Nature News. 2014.

4.         bobmanyun. 2014 1-7-2014. [cited 2014]. Available from:

5.         McMurdie PJ, Holmes S. Phyloseq: a bioconductor package for handling and analysis of high-throughput phylogenetic sequence data. Pac Symp Biocomput. 2012:235-46. Epub 2011/12/17. PubMed PMID: 22174279; PubMed Central PMCID: PMC3357092.

6.         McMurdie PJ, Holmes S. phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS One. 2013;8(4):e61217. doi: 10.1371/journal.pone.0061217. PubMed PMID: 23630581; PubMed Central PMCID: PMC3632530.

7.         Thioulouse J. Simultaneous analysis of a sequence of paired ecological tables: A comparison of several methods. Ann Appl Stat. 2011;5(4):2300-25. doi: 10.1214/10-AOAS372.

8.         Holmes PJMS. Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissible. ARXIV. 2013. Epub 10/2013.

9.         Wickham H. ggplot2: Elegant Graphics for Data Analysis: Springer; 2009. 213 p.

Taking studies of regulatory evolution to the next level: translation

Carlo Artieri, a postdoc in the group of Hunter Fraser, wrote this blog post. The paper is written by Carlo and Hunter.

Carlo Artieri, a postdoc in the group of Hunter Fraser, wrote this blog post. The paper is written by Carlo and Hunter.

Carlo Artieri writes about his new paper: Evolution at two levels of gene expression in yeast which is in press in Genome Research.

Understanding the molecular basis of regulatory variation within and between species has become a major focus of modern genetics. For instance, the majority of identified human disease-risk alleles lie in non-coding regions of the genome, suggesting that they affect gene regulation (Epstein 2009). Furthermore, it has been argued that regulatory changes have played a dominant role in explaining uniquely human attributes (King and Wilson 1975). However, our knowledge of gene regulatory evolution is based almost entirely on studies of mRNA levels, despite both the greater functional importance of protein abundance, and evidence that post-transcriptional regulation is pervasive. The availability of high-throughput methods for measuring mRNA abundance, coupled to the lack of comparable methods at the protein level have contributed to this focus; however, a new method known as ribosome profiling, or ‘riboprofiling’ (Ingolia et al. 2009), has enabled us to study the evolution of translation in much greater detail than was possible before. This method involves the construction of two RNA-seq libraries: one measuring mRNA abundance (the ‘mRNA’ fraction), and the second capturing the portion of the transcriptome that is actively being translated by ribosomes (the ‘Ribo’ fraction). On average, the abundance of genes within the Ribo fraction should be proportional to that of the mRNA fraction. Genes with increased translational efficiency are identified when Ribo fraction abundance is higher than that of the mRNA fraction, whereas reduced translational efficiency is inferred when the opposite is observed.

Riboprofiling of yeast hybrids

We performed riboprofiling on hybrids of two closely related species of budding yeast, Saccharomyces cerevisiae and S. paradoxus, (~5 million years diverged). In hybrids, the parental alleles at a locus share the same trans cellular environment; therefore in the absence of cis-regulatory divergence in transcription, both alleles should be expressed at equal levels. Conversely, cis-regulatory divergence will produce unequal expression of alleles (termed allele-specific expression, or ‘ASE’). Cis-regulatory divergence at the translational level is detected when ASE in the mRNA fraction does not equal that measured in the Ribo fraction, indicating independent divergence across levels. We also performed riboprofiling on the two parental strains, as differences in the expression of orthologs between parental species that cannot be explained by the allelic differences in the hybrids can be attributed to trans divergence. Therefore, by measuring differences in the magnitudes of ASE between the two riboprofiling fractions in the hybrids and the parents, we identified independent cis and trans regulatory changes in both mRNA abundance and translational efficiency.


We found that both cis and trans regulatory divergence in translational efficiency is widespread, and of comparable magnitude to divergence at the mRNA level – indicating that we miss much regulatory evolution by focusing on mRNA in isolation. Moreover, we observed an overwhelming bias towards divergence in opposing parental directions, indicating that while many orthologs had higher mRNA abundance in one parent, they often showed increased translational efficiency in the other parent. This suggests that stabilizing selection acts to maintain more similar protein levels between species than would be expected by comparing mRNA abundances alone.

Translational divergence not associated with TATA boxes

Interestingly, while we confirmed the results of previous studies indicating that both cis and trans regulatory divergence at the mRNA level are associated with the presence of TATA boxes and nucleosome free regions in promoters, no such relationship was found for translational divergence, indicating that these regulatory systems have different underlying architectures.

Evidence for polygenic selection at two levels

We also searched for evidence of polygenic selection in and between both regulatory levels by applying a recently developed modification of Orr’s sign test (Orr 1998; Fraser et al. 2010; Bullard et al. 2010). Under neutral divergence, no pattern is expected with regards to the parental direction of up or down-regulating alleles among orthologs within a functional group (e.g., a pathway or multi-gene complex). However, a significant bias towards one parental lineage is evidence of lineage-specific selection. This analysis uncovered evidence of polygenic selection at both regulatory levels in a number of functional groups. In particular, genes involved in tolerance to heavy metals were enriched for reinforcing divergence in mRNA abundance and translation favoring S. cerevisiae. Increased tolerance to these metals has been observed in S. cerevisiae (Warringer et al. 2011), suggesting that domesticated yeasts have experienced a history of polygenic adaptation across regulatory levels allowing them to grow on metals such as copper.

Finally, using data from the Ribo fraction, we also uncovered multiple instances of conserved stop-codon readthrough, a mechanism via which the ribosome ‘ignores’ the canonical stop codon and produces a C-terminally extended peptide. Only two cases of C-terminal extensions have previously been observed in yeast, though in one such case, PDE2, extension of the canonical protein plays a functional role in regulating cAMP levels (Namy et al. 2002). Our data suggests that this mechanism may occur in dozens of genes, highlighting yet another post-transcriptional mechanism leading to increased proteomic diversity.


By applying a novel approach to a long-standing question, our analysis has revealed that post-transcriptional regulation is abundant, and likely as important as transcriptional regulation. We argue that partitioning the search for the locus of selection into the binary categories of ‘coding’ vs. ‘regulatory’ overlooks the many opportunities for selection to act at multiple regulatory levels along the path from genotype to phenotype.


Artieri CG, Fraser HB. 2013. Evolution at two levels of gene expression in yeast. Genome Research (in press).
Preprint on the arXiv. 

Bullard JH, Mostovoy Y, Dudoit S, Brem RB. 2010. Polygenic and directional regulatory evolution across pathways in Saccharomyces. Proc Natl Acad Sci USA 107: 5058-5063.

Epstein DJ. 2009. Cis-regulatory mutations in human disease. Brief Funct Genomic Proteomic 8: 310–316.

Fraser HB, Moses AM, Schadt EE. 2010. Evidence for widespread adaptive evolution of gene expression in budding yeast. Proc Natl Acad Sci USA 107: 2977-2982.

Ingolia NT, Ghaemmaghami S, Newman JR, Weissman JS. 2009. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324:218-223.

King MC, Wilson AC. 1975. Evolution at two levels in humans and chimpanzees. Science 188: 107-116.

Namy O, Duchateau-Nguyen G, Rousset JP. 2002. Translational readthrough of the PDE2 stop codon modulates cAMP levels in Saccharomyces cerevisiae. Mol Microbiol 43: 641-652.

Orr HA. 1998. Testing natural selection vs. genetic drift in phenotypic evolution using quantitative trait locus data. Genetics 149: 2099-2104.

Warringer J, Zörgö E, Cubillos FA, Zia A, Gjuvsland A, Simpson JT, Forsmark A, Durbin R, Omholt SW, Louis EJ, Liti G, Moses A, Blomberg A. 2011. Trait variation in yeast is defined by population history. PLoS Genet 7 :e1002111.

Which genetic variants determine histone marks?


Blog author Joe Davis is a graduate student with Stephen Montgomery & Carlos Bustamante.

The wealth of genetic variation in the human genome is found not within protein-coding genes but within non-protein coding regions. This comes as no surprise given that only 1% percent of the genome codes for proteins. Until recently, efforts to determine the effects of genetic variation on trait variation and disease have focused on coding regions. Results of genome-wide association studies (GWAS), however, have shown that trait and disease associated variants are often regulatory variants such as expression quantitative trait loci (eQTLs) found in non-coding regions. These results have spurred an effort to understand the functional role of non-coding, regulatory variation. Efforts have thus far relied on characterizing the association between variants and gene expression. This association alone, however, will not reveal the complete functional mechanism by which non-coding variants influence gene expression. Recent efforts have therefore begun to characterize numerous molecular phenotypes such as transcription factor (TF) binding, histone modification, and chromatin state to determine the mechanisms by which regulatory variants affect gene expression.

One issue, four papers

In the November 8 issue of Science, three papers were published that address the role of non-coding genetic variation on TF binding, histone modifications, and chromatin state (i.e. active versus inactive enhancer status). The first study was completed by the Dermitzakis Lab at the University of Geneva. They analyzed three TFs, RNA polymerase II (Pol II), and five histone modifications using chromatin immunoprecipitation and sequencing (ChIP-Seq) in lymphoblastoid cell lines (LCLs) from two parent-child trios [1]. The second was completed by the Pritchard Lab, which has recently moved to Stanford, and the Gilad Lab at the University of Chicago. They identified genetic variants affecting variation in four histone modifications and Pol II occupancy in ten unrelated Yoruba LCLs [2]. The third study was performed by the Snyder Lab at Stanford. They characterized the genetic variation underlying changes in chromatin state using RNA-Seq and ChIP-Seq for four histone modifications and two DNA binding factors in 19 LCLs from diverse populations [3]. This work was the subject of a recent CEHG Evolgenome talk given by Maya Kasowski, the study’s first author. Finally, the fourth study, published in the November 28 issue of Nature, was performed by the Glass Lab at UCSD. They characterized the effect of natural genetic variation between two mouse strains on the binding of two TFs involved in cell differentiation (PU.1 and C/EBPα) using ChIP-Seq [4]. In this post, I will analyze primarily the work presented by the Pritchard Lab, but I strongly recommend reading all four papers to understand the challenges in characterizing non-coding variation and the methods available to do so.


The four studies seek to answer the general question of how regulatory variation affects gene expression. They characterize diverse molecular phenotypes such as histone modifications and TF binding to understand the mechanisms of action for non-coding variants. The Pritchard Lab focused their study on four histone modifications (three active and one repressive: H3K4me3, H3K4me1, H3K27ac, and H3K27me3, respectively) and Pol II occupancy.


Histone modifications 101

Histone modifications refer to the addition of chemical groups such as methyl or acetyl to specific amino acids on the tails of histone proteins comprising the nucleosome. These chemical groups are referred to as histone marks. They can serve a wide range of functions, but in general they are associated with the accessibility of a chromatin region. For example, the tri-methylation of lysine 4 of histone 3 (H3K4me3) is associated with increased chromatin accessibility and gene activation. On the other hand, increased levels of the repressive mark H3K27me3 (tri-methylation of lysine 27 of histone 3) at promoters is associated with gene inactivation.

Histone mark levels are measured in a high-throughput manner using ChIP-Seq. Briefly, an antibody targeting the mark of interest is used to pull down modified genomic regions. These immunoprecipitated regions are then sequenced to determine which genomic segments are modified and at what level. The procedure usually requires a large number of cells (on the order of 10^7). Therefore, the modification level is, in some ways, a population level measurement. Analysis of ChIP-Seq data typically involves testing for genomic regions with more reads than expected by chance. These regions, ranging from 200bp to 1000bp or more, are referred to as peaks that represent a modification level above the genomic background. Repressive marks like H3K27me3 tend to have broad peak regions, while activating marks like H3K4me3 can have much tighter peaks.

Since modification levels represent measurements on a population of cells and histone residues can have multiple modifications, genomic regions can show evidence for multiple marks. The combinations of these marks over a region can mark the function of the region. For example, regions with high levels of H3K27ac and a high ratio of H3K4me1 to H3K4me3 can mark active enhancer regions. Until now, the variation of these marks between individuals and the genetic cause of this variation was uncharacterized. Moreover, the causal impact of these marks remains unknown. Do they alter gene expression directly or are they altered by gene regulation? Therefore, the two guiding questions for this study are:

1. What genetic variants influence histone modifications?

2. Are these modifications “a cause or a consequence of gene regulation?”

Variation in histone modifications, a real whodunit

The authors first seek to identify and characterize genetic variants that influence histone marks. They generated ChIP-Seq data for the four histone marks and Pol II in LCLs derived from ten unrelated Yoruba individuals who were previously genotyped as part of the 1000 Genomes Project. Similar studies of regulatory variants such as eQTL studies require large sample sizes to detect the effects of regulatory variants that often lie outside the gene. Unlike eQTL studies, histone marks cover fairly broad regions often encompassing causal regulatory variants. As a result, the authors can use a smaller sample size and still be confident about interrogating the effects of causal regulatory SNPs. The authors developed a statistical test that models total read depth between individuals and allelic imbalance between haplotypes within individuals to increase power to detect cis-QTLs (i.e. variants that affect histone marks and Pol II occupancy nearby in the genome). Using this method, they identified over 1200 distinct QTLs for histone marks and Pol II occupancy (FDR 20%).

The authors then analyze these histone mark and Pol II QTLs to determine the overlap of these variants with other known regulatory variants. The hypothesis is that regulatory variants that affect gene expression will have effects on diverse molecular phenotypes. Therefore, variants that influence histone marks and Pol II should show significant overlap with known regulatory variants such as eQTLs and DNase I sensitivity QTLs (dsQTLs). DNase I sensitivity is a measure of chromatin accessibility with higher sensitivity associated with higher accessibility. The Pritchard Lab mapped eQTLs and dsQTLs in a larger sample of ~75 Yoruban LCLs in two previous studies that I also recommend reading [5,6]. Their analysis revealed an enrichment of low p-values for dsQTLs and, to a lesser extent, eQTLs when tested as histone mark and Pol II QTLs. In addition, the authors observed a coordinated change in multiple molecular phenotypes at dsQTLs and eQTLs. For example, higher levels of the three histone active marks were observed at dsQTLs for the more DNase I sensitive genotype. At eQTLs, H3K4me3, H3K27ac, and Pol II levels were higher for individuals with the high expression genotype. These results show that non-coding regulatory variants impact multiple molecular phenotypes ranging from chromatin accessibility and transcription to histone modifications. The authors provide strong evidence in response to their first guiding question, namely that non-coding regulatory polymorphisms associate with variation in histone marks and Pol II.

TFs and a question of directionality

The authors then turned to addressing the questions of causality for these marks. To do so, they analyze genetic variants in TF binding sites. The main hypothesis is that regulatory variants that alter a TFBS will modify TF binding which will cause changes in histone mark and Pol II levels nearby. If this is the case, then changes in histone marks are a consequence of how strong the TF binding site is. On the other hand, if these marks were causal, polymorphisms in TF binding sites would not be expected to show strong association with changes in these marks.

To test their hypothesis, the authors examine ~11.5K TF binding sites with polymorphisms heterozygous in at least 1 of their 10 individuals. They calculate the change in position weight matrix (PWM) score between the two alleles for polymorphic TF binding sites within each individual. They then test for significant association between this change in PWM and allelic imbalance of ChIP-Seq reads at nearby heterozygous sites. The idea is that if a variant improves (or disrupts) TF binding for one allele at a TF binding site then active histone marks nearby on the same allele will increase (or decrease). Repressive histone marks (in this case H3K27me3) are expected to have the opposite response. Indeed, when they apply their test, they find a significant positive association for the active marks and a negative association for the repressive mark. This result supports the hypothesis of changes histone marks as a consequence of TF binding and gene regulation. However, this result does not rule out other possibilities. Histone marks can still play a causal role in the establishment of TF binding. In other words, the relationship between TF binding and histone marks does not have to be unidirectional. In addition, there is evidence that long non-coding RNAs may play a role in the establishment and regulation of histone marks.

dsQTLs and eQTLs, a match made on chromatin

In their final analysis, the authors examine dsQTLs that are also eQTLs. Since these variants associate with both gene expression and chromatin accessibility at distal regulatory regions (>5kb from associated TSS), the authors can assign the regulatory region to a specific gene. A variant that is both a dsQTL and an eQTL likely disrupts a distal regulatory region. In addition to disrupting the accessibility of the regulatory region, the variant also perturbs the expression of a gene influenced by the regulatory region. For example, a variant may decrease the chromatin accessibility of an enhancer region and thereby decrease the level of active histone marks for the enhancer. This decreased enhancer activity can result in decreased transcription from a nearby gene and similarly decreased active mark levels for the gene. Therefore, the hypothesis guiding this analysis is that variants influencing the histone marks of a distal regulatory region will have a coordinated effect on histone marks at genes under the control of the regulatory region. The authors examine the allelic imbalance in ChIP-Seq reads at regulatory regions and their associated transcription start sites (TSS). Indeed, the authors observe that variants that increase DNase I sensitivity have a significant positive allelic imbalance for active marks at both the regulatory region and the TSS. The opposite is true for the repressive mark. This result again emphasizes the complexity of gene regulation and the impact of non-coding variation. Not only do regulatory variants influence diverse molecular phenotypes nearby, they can direct changes at distal loci. As the authors note, this coordinated change in histone marks between distal regions possibly reflects the 3D organization of chromatin. Regulatory variants that impact chromatin looping interactions between distal regulatory regions and genes may cause changes in activity levels for both the gene and the regulatory region.


This paper provides clear evidence that regulatory variation has very complex impacts affecting multiple and diverse molecular phenotypes at multiple regions simultaneously. This complexity implies potentially numerous and diverse mechanisms by which regulatory variants act on gene regulation. The authors set out to find evidence for one of these mechanisms, namely perturbation of TF binding sites. They begin by showing that variation in histone modifications has a strong genetic basis and that the polymorphisms influencing these marks overlap with known regulatory variants such as eQTLs. They then show that polymorphisms in TF binding sites associate with changes in histone marks, providing evidence for directionality in the relationship between these marks and gene regulation. In essence, their results suggest that histone modifications are directed, at least in part, by TF binding. Finally, they find that regulatory variants can have an impact on the molecular phenotypes of distal regions.

I found this paper, as well as the other three previously mentioned, to be quite interesting. I think these papers show that our understanding of gene regulation is still very simplistic. With the advent of high-throughput molecular assays like ChIP-Seq and DNase-Seq, we can begin to interrogate the complex role of regulatory variation on many phenotypes. In doing so, it is of primary interest to ask questions regarding directionality. How do a given set of molecular phenotypes relate? Do these phenotypes represent a cause or a consequence of genome function? How do the diverse elements of gene regulation function together to build complex phenotypes?


[1] Kilpinen, H., Waszak, S. M., Gschwind, A. R., Raghav, S. K., Witwicki, R. M., Orioli, A., Dermitzakis, E. T., et al. (2013). Coordinated Effects of Sequence Variation on DNA Binding, Chromatin Structure, and Transcription. Science (New York, N.Y.), 744. doi:10.1126/science.1242463

[2] McVicker, G., van de Geijn, B., Degner, J. F., Cain, C. E., Banovich, N. E., Raj, A., Pritchard, J. K., et al. (2013). Identification of Genetic Variants That Affect Histone Modifications in Human Cells. Science (New York, N.Y.), 747. doi:10.1126/science.1242429

[3] Kasowski, M., Kyriazopoulou-Panagiotopoulou, S., Grubert, F., Zaugg, J. B., Kundaje, A., Liu, Y., Snyder, M., et al. (2013). Extensive Variation in Chromatin States Across Humans. Science (New York, N.Y.), 750. doi:10.1126/science.1242510

[4] Heinz, S., Romanoski, C. E., Benner, C., Allison, K. a, Kaikkonen, M. U., Orozco, L. D., & Glass, C. K. (2013). Effect of natural genetic variation on enhancer selection and function. Nature, 503(7477), 487–492. doi:10.1038/nature12615

[5] Pickrell, J. K., Marioni, J. C., Pai, A. a, Degner, J. F., Engelhardt, B. E., Nkadori, E., Pritchard, J. K., et al. (2010). Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature, 464(7289), 768–72. doi:10.1038/nature08872

[6] Degner, J. F., Pai, A. a, Pique-Regi, R., Veyrieras, J.-B., Gaffney, D. J., Pickrell, J. K., Pritchard, J. K., et al. (2012). DNase I sensitivity QTLs are a major determinant of human expression variation. Nature, 482(7385), 390–4. doi:10.1038/nature10808

Paper author Jonathan Pritchard is a professor in the Departments of Genetics and Biology.

Paper author Jonathan Pritchard is a professor in the Departments of Genetics and Biology.

Biology Thinks Big! Symposium great success


Blog author: Jeremy is a graduate student in the Hadly lab.

Jeremy Hsu is a graduate student in the Hadly lab. He writes about the Biology Thinks Big! Symposium which he organized. It was a great success. This post was previously published on the Hadly Lab blog

On Wednesday evening, we held the first Stanford Biology Thinks Big! symposium, where four Stanford faculty from across the biosciences each gave a short, ten-minute TED-like talk about a big idea in their research or their field.

I first starting thinking about hosting such an event last year, as a way to showcase the amazing diversity of scientific research and viewpoints within the biosciences at Stanford, and to introduce people to science and research in an accessible, engaging format. To that end, I was thrilled with the four speakers we had, with John Boothroyd, Mary Teruel, Carlos Bustamante, and Susan McConnell each representing different departments here at Stanford. Even with such an amazing line-up of faculty, though, I have to admit that I was a bit nervous about turnout for the event – this is the first time anyone’s tried doing such an event here, and I had no idea what to expect.


The first indication that my nervousness was unwarranted, though, came when people starting arriving thirty minutes prior to the start time! More and more people filled in, and the response was overwhelming – not only was our room at maximum capacity (with the aisles filled with people sitting and standing), but our overflow room next door was also absolutely jam packed, with people still trying to peer in from outside the doors.


It was clear that the audience was very enthusiastic and eager about this event. I took a moment to poll the audience at the beginning, and there was a good distribution of undergraduates, graduate students, post-docs, and general community members!


Despite some of our A/V equipment not cooperating in our overflow room, the excitement of the audience was palpable. Dr. Boothroyd (Department of Microbiology and Immunology) began by talking about his work on a specific microbe, and Dr. Teruel (Department of Chemical and Systems Biology) highlighted the mechanisms behind the development of fat cells.



Next, Dr. Bustamante (Department of Genetics) engaged the audience in a wide-ranging discussion on genetics, before Dr. McConnell (Department of Biology) wrapped up the symposium with an inspiring message on science and conservation, using her own photographs of nature and animals.


It was a great experience to hear four different perspectives on science, and to listen to such a diversity of ideas. Afterward, I was thrilled to see so many students come up to each of the speakers to ask more questions about their work, and many of the students even asked about advice regarding finding and joining a lab or the graduate school process!

Overall, it was clear that everyone was eager to hear the big ideas that Stanford researchers have, and I’m already looking forward to the next time we do this!

— Jeremy