Testing for selection in regulatory sequences using an empirical mutational distribution

How to detect selection?

Dave Yuan is a postdoc in Dmitri Petrov's lab.

Blog author Dave Yuan is a postdoc in Dmitri Petrov’s lab.

Detecting and quantifying selection in genomes is a fundamental task of interest for evolutionary biologists. A common method for this relies on comparing patterns of polymorphism and divergence between synonymous and non-synonymous sites. Synonymous sites are expected to be almost neutral, and thus mutations at these sites are expected to be fixed or lost due to genetic drift or draft. At non-synonymous sites however, mutations may get fixed due to positive selection or lost due to purifying selection. If in a specific gene, many non-synonymous sites get fixed due to positive selection, then these sites as a group will show a high evolutionary rate. On the other hand, if in a specific gene most non-synonymous mutations are lost because of purifying selection, then these sites will show a low evolutionary rate. Importantly, to determine whether the rate is high or low, we need a group of sites that can be used as a neutral comparison. For coding regions, synonymous sites are a natural choice for this comparison. [McDonald & Kreitman 1991; Keightley & Eyre-Walker 2007, Bustamante et al. 2001].

What about non-coding sequences?

Much of the genome, however, is comprised of non-coding sequence. Such sequence may contain regulatory information critical for gene expression, the modification of which is important for phenotypic evolution. Detecting selection among regulatory variation is thus of interest to evolutionary biologists, but this has been challenging. This is because functional annotation of non-coding DNA tends to be sparse, and we currently do not understand the “regulatory genetic code.” Although selection tests developed for coding sequence have been applied to non-coding sequence [reviewed in Zhen & Andolfatto 2012], a common impediment has been the choice of a group of sites that can function as a neutral comparison. A solution to this is to generate a large number of mutations in a specific region of the genome and determine whether these mutations have functional impacts. The sites at which mutations do not appear to have function can then be used to compare other groups of sites with. In a recent paper published in Molecular Biology and Evolution, graduate students Justin Smith and Kimberly McManus and CEHG faculty Hunter Fraser describe their development and application of this novel method to test for selection among variation in mammalian regulatory elements using such null distribution of mutations.

Null distribution of random mutations

Mutagenesis technique used by Patwardhan et al. (2012) to generate a comprehensive collection of cis-regulatory element mutants and test their phenotypes in vivo (figures from Patwardhan et al., 2012)

Mutagenesis technique used by Patwardhan et al. (2012) to generate a comprehensive collection of cis-regulatory element mutants and test their phenotypes in vivo (figures from Patwardhan et al., 2012)

Generating an empirical null distribution as the neutral comparison is not a trivial task. A sufficiently large—ideally comprehensive—set of mutations needs to be engineered into the regulatory element of interest, and the mutational effects or phenotypes need to be assessed. This distribution of phenotypes is then the null distribution against which the observed variation is compared to test for selection. Fortunately, recent developments in mutagenesis coupled with high-throughput sequencing have made this possible in high-resolution. Smith et al. chose data from one such mutagenesis platform that generated over 640,000 mutant haplotypes across three mammalian enhancer sequences [Patwardhan et al. 2012]. Specifically, the library of mutant enhancers was made using polymerase cycling assembly (PCA) with oligonucleotides containing between 2-3% degeneracy. All possible single nucleotide variants of the wild-type enhancer were thus represented. The library of enhancers was then cloned into a plasmid upstream of a reporter gene along with unique identification tags. This plasmid library was both sequenced to identify the tag corresponding with the mutant enhancer and injected into mouse for in vivo reporter assay. Finally, sequencing of the cDNA from the mouse liver quantified the transcriptional abundance of the tags and hence the phenotypic effects of the mutations. For each mutation it was now clear whether it upregulated or downregulated the reporter gene or whether it had no effect.

Developing a test to compare mutations and observed variation

With this dataset, Smith et al. had a comprehensive spectrum of random mutations and their phenotypic effects as the null distribution. This allowed them to create metrics for regulatory variation that are similar to the commonly-used Ka/Ks ratio, with Ka being the rate of non-synonymous change and Ks the rate of synonymous change (no functional impact on protein and hence neutral) [Kimura 1977]. The in vivo reporter assay revealed mutations with no phenotypic impact (i.e. no change in transcriptional abundance compared to wild-type), and these are analogous to synonymous or neutral changes. The new metrics are dubbed Ku/Kn, and Kd/Kn, where Ku is the rate of change for up-regulatory mutations (those with increased expression from the in vivo reporter assay), Kd is the rate of change for down-regulatory mutations, and Kn the rate of change for mutations that didn’t change expression (silent or neutral mutations).

Metrics to compare observed mutations in the phylogeny to possible mutations seen in the mutagenesis data (Figure 1 from Smith et al 2013).

Metrics to compare observed mutations in the phylogeny to possible mutations seen in the mutagenesis data (Figure 1 from Smith et al 2013).

For their analysis, the authors chose enhancer sequences from species within the same phylogenetic orders as the mutagenized enhancers. In addition to enhancer sequences from extant species, the authors also reconstructed ancestral sequences throughout the phylogeny. Combined with the mutagenesis data, each K metric at a node in the phylogeny is then calculated as the ratio of observed (i.e. in ancestors and extant species) frequencies of silent, up-, or down-regulatory polymorphisms to the frequencies of all possible silent, up-, or down-regulatory mutations respectively. Selection is inferred by comparing the ratio of up- or down-regulatory polymorphisms to the ratio of silent mutations (i.e. Ku/Kn or Kd/Kn). A comparatively low Ku or Kd, or rate of up- or down-regulatory mutations (Ku/Kn or Kd/Kn < 1) would suggest purifying selection on the polymorphisms, while a higher rate of up- or down-regulatory mutations (Ku/Kn or Kd/Kn > 1) would suggest positive selection. Smith et al. applied their new test for selection on the three enhancers from [Patwardhan et al. 2012] across the respective phylogenetic orders: LTV1 in rodents and ALDOB and ECR11 in primates. They detected purifying selection against down-regulatory polymorphisms for all three enhancers, while positive selection for up-regulatory polymorphisms was also detected for LTV1.

Detecting selection using an empirically-derived null distribution

Making evolutionary sense of variation in the regulatory regions of the genome remains more challenging than for coding sequences. We still do not have a “neutral model of regulatory evolution” to compare observed variation against. Perhaps the most exciting element of this paper, at least for me, is the use of an empirically-derived null distribution as the neutral expectation to perform this evolutionary inquiry. Patwardhan and the Shendure group at the University of Washington had earlier published a mutagenesis technique that generated a wide spectrum of mutants [Patwardhan et al. 2009]. At this time, I was getting interested in questions on the “grammar” of gene regulation, the functional characterization of regulatory sequences, and how to understand regulatory variation evolutionarily. It was thus very exciting to see both, a massively comprehensive interrogation of the mutational consequences in a regulatory element, as well as the clever application of this data to overcome a challenging evolutionary question.

One of the strengths of the Smith et al. study is the reliance on a spectrum of random mutations as the null distribution. As the original source of all genetic variation, mutations arise in a random manner. Of those that do not exert lethal effects, they may persist by chance within a population and then eventually reach certain frequencies or even fixation under selection. Because the null distribution used by Smith et al. comprises all possible mutations, it represents the mutation spectrum prior to the actions of drift or selection. It is thus an even better neutral expectation than synonymous mutations, which may not be truly neutral. In addition, using such empirical null distribution to test for selection is not limited to just regulatory variation but can be applied to coding sequence variation to reduce bias and false signals. Furthermore, by categorizing mutational effects as up- and down-regulatory, different modes of selection acting on a regulatory element can be teased apart. The interspersion of mutations—silent, up-, or down-regulatory—across the regulatory element also reduces confounding effects of regional variation in mutation rate.

As with all science, more is hoped for the future. Towards the end of the paper, the authors discuss prospects of more high-resolution mutagenesis data and, perhaps more importantly in terms of accessibility and ease of use, ability to use limited mutagenesis to test selection with. Tissue- and organism-specificity in terms of mutational effects may also be further investigated, as well as the inclusion of mutation types other than single nucleotide substitutions (e.g. insertion/deletion, copy number variation) or consideration of genomic regional context (e.g. effect of chromatin or epistasis). Nevertheless, this study represents an exciting new method to investigate regulatory variation in evolutionary contexts, one whose development and further application I look forward to seeing.


Bustamante CD, Wakeley J, Sawyer S, and Hartl DL. Directional Selection and the Site-Frequency Spectrum. Genetics 159:1779-1788 (2001).

Keightley PD and Eyre-Walker A. Joint inference of the distribution of fitness effects of deleterious mutations and population demography based on nucleotide polymorphism frequencies. Genetics 177:2251-2261 (2007).

Kimura M. Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution. Nature 267:275-276 (1977).

McDonald JH and Kreitman M. Adaptive Protein Evolution at the Adh Locus in Drosophila. Nature 351:652-654 (1991).

Patwardhan RP, Lee C, Litvin O, Young DL, Pe’er D, and Shendure J. High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis. Nature Biotechnology 27:1173-1175 (2009)

Patwardhan RP, Hiatt JB, Witten DM, Kim MJ, Smith RP, May D, Lee C, Andrie JM, Lee S-I, Cooper GM, et al. Massively parallel functional dissection of mammalian enhancers in vivo. Nature Biotechnology 30:265-270 (2012).

Smith JD, McManus KF, and Fraser HB. A Novel Test for Selection on cis-Regulatory Elements Reveals Positive and Negative Selection Acting on Mammalian Transcriptional Enhancers.    Molecular Biology and Evolution 30:2509-2518 (2013).

Zhen Y and Andolfatto P. Methods to Detect Selection on Noncoding DNA in Evolutionary Genomics: Statistical and Computational Methods, Volume 2, Methods in Molecular Biology, vol. 856, edited by Anisimova M. Humana Press, New York (2012).

Paper author Justin Smith is a graduate student in Hunter Fraser's lab.

Paper author Justin Smith is a graduate student in Hunter Fraser’s lab.

Taking studies of regulatory evolution to the next level: translation

Carlo Artieri, a postdoc in the group of Hunter Fraser, wrote this blog post. The paper is written by Carlo and Hunter.

Carlo Artieri, a postdoc in the group of Hunter Fraser, wrote this blog post. The paper is written by Carlo and Hunter.

Carlo Artieri writes about his new paper: Evolution at two levels of gene expression in yeast which is in press in Genome Research.

Understanding the molecular basis of regulatory variation within and between species has become a major focus of modern genetics. For instance, the majority of identified human disease-risk alleles lie in non-coding regions of the genome, suggesting that they affect gene regulation (Epstein 2009). Furthermore, it has been argued that regulatory changes have played a dominant role in explaining uniquely human attributes (King and Wilson 1975). However, our knowledge of gene regulatory evolution is based almost entirely on studies of mRNA levels, despite both the greater functional importance of protein abundance, and evidence that post-transcriptional regulation is pervasive. The availability of high-throughput methods for measuring mRNA abundance, coupled to the lack of comparable methods at the protein level have contributed to this focus; however, a new method known as ribosome profiling, or ‘riboprofiling’ (Ingolia et al. 2009), has enabled us to study the evolution of translation in much greater detail than was possible before. This method involves the construction of two RNA-seq libraries: one measuring mRNA abundance (the ‘mRNA’ fraction), and the second capturing the portion of the transcriptome that is actively being translated by ribosomes (the ‘Ribo’ fraction). On average, the abundance of genes within the Ribo fraction should be proportional to that of the mRNA fraction. Genes with increased translational efficiency are identified when Ribo fraction abundance is higher than that of the mRNA fraction, whereas reduced translational efficiency is inferred when the opposite is observed.

Riboprofiling of yeast hybrids

We performed riboprofiling on hybrids of two closely related species of budding yeast, Saccharomyces cerevisiae and S. paradoxus, (~5 million years diverged). In hybrids, the parental alleles at a locus share the same trans cellular environment; therefore in the absence of cis-regulatory divergence in transcription, both alleles should be expressed at equal levels. Conversely, cis-regulatory divergence will produce unequal expression of alleles (termed allele-specific expression, or ‘ASE’). Cis-regulatory divergence at the translational level is detected when ASE in the mRNA fraction does not equal that measured in the Ribo fraction, indicating independent divergence across levels. We also performed riboprofiling on the two parental strains, as differences in the expression of orthologs between parental species that cannot be explained by the allelic differences in the hybrids can be attributed to trans divergence. Therefore, by measuring differences in the magnitudes of ASE between the two riboprofiling fractions in the hybrids and the parents, we identified independent cis and trans regulatory changes in both mRNA abundance and translational efficiency.


We found that both cis and trans regulatory divergence in translational efficiency is widespread, and of comparable magnitude to divergence at the mRNA level – indicating that we miss much regulatory evolution by focusing on mRNA in isolation. Moreover, we observed an overwhelming bias towards divergence in opposing parental directions, indicating that while many orthologs had higher mRNA abundance in one parent, they often showed increased translational efficiency in the other parent. This suggests that stabilizing selection acts to maintain more similar protein levels between species than would be expected by comparing mRNA abundances alone.

Translational divergence not associated with TATA boxes

Interestingly, while we confirmed the results of previous studies indicating that both cis and trans regulatory divergence at the mRNA level are associated with the presence of TATA boxes and nucleosome free regions in promoters, no such relationship was found for translational divergence, indicating that these regulatory systems have different underlying architectures.

Evidence for polygenic selection at two levels

We also searched for evidence of polygenic selection in and between both regulatory levels by applying a recently developed modification of Orr’s sign test (Orr 1998; Fraser et al. 2010; Bullard et al. 2010). Under neutral divergence, no pattern is expected with regards to the parental direction of up or down-regulating alleles among orthologs within a functional group (e.g., a pathway or multi-gene complex). However, a significant bias towards one parental lineage is evidence of lineage-specific selection. This analysis uncovered evidence of polygenic selection at both regulatory levels in a number of functional groups. In particular, genes involved in tolerance to heavy metals were enriched for reinforcing divergence in mRNA abundance and translation favoring S. cerevisiae. Increased tolerance to these metals has been observed in S. cerevisiae (Warringer et al. 2011), suggesting that domesticated yeasts have experienced a history of polygenic adaptation across regulatory levels allowing them to grow on metals such as copper.

Finally, using data from the Ribo fraction, we also uncovered multiple instances of conserved stop-codon readthrough, a mechanism via which the ribosome ‘ignores’ the canonical stop codon and produces a C-terminally extended peptide. Only two cases of C-terminal extensions have previously been observed in yeast, though in one such case, PDE2, extension of the canonical protein plays a functional role in regulating cAMP levels (Namy et al. 2002). Our data suggests that this mechanism may occur in dozens of genes, highlighting yet another post-transcriptional mechanism leading to increased proteomic diversity.


By applying a novel approach to a long-standing question, our analysis has revealed that post-transcriptional regulation is abundant, and likely as important as transcriptional regulation. We argue that partitioning the search for the locus of selection into the binary categories of ‘coding’ vs. ‘regulatory’ overlooks the many opportunities for selection to act at multiple regulatory levels along the path from genotype to phenotype.


Artieri CG, Fraser HB. 2013. Evolution at two levels of gene expression in yeast. Genome Research (in press).
Preprint on the arXiv. 

Bullard JH, Mostovoy Y, Dudoit S, Brem RB. 2010. Polygenic and directional regulatory evolution across pathways in Saccharomyces. Proc Natl Acad Sci USA 107: 5058-5063.

Epstein DJ. 2009. Cis-regulatory mutations in human disease. Brief Funct Genomic Proteomic 8: 310–316.

Fraser HB, Moses AM, Schadt EE. 2010. Evidence for widespread adaptive evolution of gene expression in budding yeast. Proc Natl Acad Sci USA 107: 2977-2982.

Ingolia NT, Ghaemmaghami S, Newman JR, Weissman JS. 2009. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324:218-223.

King MC, Wilson AC. 1975. Evolution at two levels in humans and chimpanzees. Science 188: 107-116.

Namy O, Duchateau-Nguyen G, Rousset JP. 2002. Translational readthrough of the PDE2 stop codon modulates cAMP levels in Saccharomyces cerevisiae. Mol Microbiol 43: 641-652.

Orr HA. 1998. Testing natural selection vs. genetic drift in phenotypic evolution using quantitative trait locus data. Genetics 149: 2099-2104.

Warringer J, Zörgö E, Cubillos FA, Zia A, Gjuvsland A, Simpson JT, Forsmark A, Durbin R, Omholt SW, Louis EJ, Liti G, Moses A, Blomberg A. 2011. Trait variation in yeast is defined by population history. PLoS Genet 7 :e1002111.