Testing for selection in regulatory sequences using an empirical mutational distribution

How to detect selection?

Dave Yuan is a postdoc in Dmitri Petrov's lab.

Blog author Dave Yuan is a postdoc in Dmitri Petrov’s lab.

Detecting and quantifying selection in genomes is a fundamental task of interest for evolutionary biologists. A common method for this relies on comparing patterns of polymorphism and divergence between synonymous and non-synonymous sites. Synonymous sites are expected to be almost neutral, and thus mutations at these sites are expected to be fixed or lost due to genetic drift or draft. At non-synonymous sites however, mutations may get fixed due to positive selection or lost due to purifying selection. If in a specific gene, many non-synonymous sites get fixed due to positive selection, then these sites as a group will show a high evolutionary rate. On the other hand, if in a specific gene most non-synonymous mutations are lost because of purifying selection, then these sites will show a low evolutionary rate. Importantly, to determine whether the rate is high or low, we need a group of sites that can be used as a neutral comparison. For coding regions, synonymous sites are a natural choice for this comparison. [McDonald & Kreitman 1991; Keightley & Eyre-Walker 2007, Bustamante et al. 2001].

What about non-coding sequences?

Much of the genome, however, is comprised of non-coding sequence. Such sequence may contain regulatory information critical for gene expression, the modification of which is important for phenotypic evolution. Detecting selection among regulatory variation is thus of interest to evolutionary biologists, but this has been challenging. This is because functional annotation of non-coding DNA tends to be sparse, and we currently do not understand the “regulatory genetic code.” Although selection tests developed for coding sequence have been applied to non-coding sequence [reviewed in Zhen & Andolfatto 2012], a common impediment has been the choice of a group of sites that can function as a neutral comparison. A solution to this is to generate a large number of mutations in a specific region of the genome and determine whether these mutations have functional impacts. The sites at which mutations do not appear to have function can then be used to compare other groups of sites with. In a recent paper published in Molecular Biology and Evolution, graduate students Justin Smith and Kimberly McManus and CEHG faculty Hunter Fraser describe their development and application of this novel method to test for selection among variation in mammalian regulatory elements using such null distribution of mutations.

Null distribution of random mutations

Mutagenesis technique used by Patwardhan et al. (2012) to generate a comprehensive collection of cis-regulatory element mutants and test their phenotypes in vivo (figures from Patwardhan et al., 2012)

Mutagenesis technique used by Patwardhan et al. (2012) to generate a comprehensive collection of cis-regulatory element mutants and test their phenotypes in vivo (figures from Patwardhan et al., 2012)

Generating an empirical null distribution as the neutral comparison is not a trivial task. A sufficiently large—ideally comprehensive—set of mutations needs to be engineered into the regulatory element of interest, and the mutational effects or phenotypes need to be assessed. This distribution of phenotypes is then the null distribution against which the observed variation is compared to test for selection. Fortunately, recent developments in mutagenesis coupled with high-throughput sequencing have made this possible in high-resolution. Smith et al. chose data from one such mutagenesis platform that generated over 640,000 mutant haplotypes across three mammalian enhancer sequences [Patwardhan et al. 2012]. Specifically, the library of mutant enhancers was made using polymerase cycling assembly (PCA) with oligonucleotides containing between 2-3% degeneracy. All possible single nucleotide variants of the wild-type enhancer were thus represented. The library of enhancers was then cloned into a plasmid upstream of a reporter gene along with unique identification tags. This plasmid library was both sequenced to identify the tag corresponding with the mutant enhancer and injected into mouse for in vivo reporter assay. Finally, sequencing of the cDNA from the mouse liver quantified the transcriptional abundance of the tags and hence the phenotypic effects of the mutations. For each mutation it was now clear whether it upregulated or downregulated the reporter gene or whether it had no effect.

Developing a test to compare mutations and observed variation

With this dataset, Smith et al. had a comprehensive spectrum of random mutations and their phenotypic effects as the null distribution. This allowed them to create metrics for regulatory variation that are similar to the commonly-used Ka/Ks ratio, with Ka being the rate of non-synonymous change and Ks the rate of synonymous change (no functional impact on protein and hence neutral) [Kimura 1977]. The in vivo reporter assay revealed mutations with no phenotypic impact (i.e. no change in transcriptional abundance compared to wild-type), and these are analogous to synonymous or neutral changes. The new metrics are dubbed Ku/Kn, and Kd/Kn, where Ku is the rate of change for up-regulatory mutations (those with increased expression from the in vivo reporter assay), Kd is the rate of change for down-regulatory mutations, and Kn the rate of change for mutations that didn’t change expression (silent or neutral mutations).

Metrics to compare observed mutations in the phylogeny to possible mutations seen in the mutagenesis data (Figure 1 from Smith et al 2013).

Metrics to compare observed mutations in the phylogeny to possible mutations seen in the mutagenesis data (Figure 1 from Smith et al 2013).

For their analysis, the authors chose enhancer sequences from species within the same phylogenetic orders as the mutagenized enhancers. In addition to enhancer sequences from extant species, the authors also reconstructed ancestral sequences throughout the phylogeny. Combined with the mutagenesis data, each K metric at a node in the phylogeny is then calculated as the ratio of observed (i.e. in ancestors and extant species) frequencies of silent, up-, or down-regulatory polymorphisms to the frequencies of all possible silent, up-, or down-regulatory mutations respectively. Selection is inferred by comparing the ratio of up- or down-regulatory polymorphisms to the ratio of silent mutations (i.e. Ku/Kn or Kd/Kn). A comparatively low Ku or Kd, or rate of up- or down-regulatory mutations (Ku/Kn or Kd/Kn < 1) would suggest purifying selection on the polymorphisms, while a higher rate of up- or down-regulatory mutations (Ku/Kn or Kd/Kn > 1) would suggest positive selection. Smith et al. applied their new test for selection on the three enhancers from [Patwardhan et al. 2012] across the respective phylogenetic orders: LTV1 in rodents and ALDOB and ECR11 in primates. They detected purifying selection against down-regulatory polymorphisms for all three enhancers, while positive selection for up-regulatory polymorphisms was also detected for LTV1.

Detecting selection using an empirically-derived null distribution

Making evolutionary sense of variation in the regulatory regions of the genome remains more challenging than for coding sequences. We still do not have a “neutral model of regulatory evolution” to compare observed variation against. Perhaps the most exciting element of this paper, at least for me, is the use of an empirically-derived null distribution as the neutral expectation to perform this evolutionary inquiry. Patwardhan and the Shendure group at the University of Washington had earlier published a mutagenesis technique that generated a wide spectrum of mutants [Patwardhan et al. 2009]. At this time, I was getting interested in questions on the “grammar” of gene regulation, the functional characterization of regulatory sequences, and how to understand regulatory variation evolutionarily. It was thus very exciting to see both, a massively comprehensive interrogation of the mutational consequences in a regulatory element, as well as the clever application of this data to overcome a challenging evolutionary question.

One of the strengths of the Smith et al. study is the reliance on a spectrum of random mutations as the null distribution. As the original source of all genetic variation, mutations arise in a random manner. Of those that do not exert lethal effects, they may persist by chance within a population and then eventually reach certain frequencies or even fixation under selection. Because the null distribution used by Smith et al. comprises all possible mutations, it represents the mutation spectrum prior to the actions of drift or selection. It is thus an even better neutral expectation than synonymous mutations, which may not be truly neutral. In addition, using such empirical null distribution to test for selection is not limited to just regulatory variation but can be applied to coding sequence variation to reduce bias and false signals. Furthermore, by categorizing mutational effects as up- and down-regulatory, different modes of selection acting on a regulatory element can be teased apart. The interspersion of mutations—silent, up-, or down-regulatory—across the regulatory element also reduces confounding effects of regional variation in mutation rate.

As with all science, more is hoped for the future. Towards the end of the paper, the authors discuss prospects of more high-resolution mutagenesis data and, perhaps more importantly in terms of accessibility and ease of use, ability to use limited mutagenesis to test selection with. Tissue- and organism-specificity in terms of mutational effects may also be further investigated, as well as the inclusion of mutation types other than single nucleotide substitutions (e.g. insertion/deletion, copy number variation) or consideration of genomic regional context (e.g. effect of chromatin or epistasis). Nevertheless, this study represents an exciting new method to investigate regulatory variation in evolutionary contexts, one whose development and further application I look forward to seeing.


Bustamante CD, Wakeley J, Sawyer S, and Hartl DL. Directional Selection and the Site-Frequency Spectrum. Genetics 159:1779-1788 (2001).

Keightley PD and Eyre-Walker A. Joint inference of the distribution of fitness effects of deleterious mutations and population demography based on nucleotide polymorphism frequencies. Genetics 177:2251-2261 (2007).

Kimura M. Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution. Nature 267:275-276 (1977).

McDonald JH and Kreitman M. Adaptive Protein Evolution at the Adh Locus in Drosophila. Nature 351:652-654 (1991).

Patwardhan RP, Lee C, Litvin O, Young DL, Pe’er D, and Shendure J. High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis. Nature Biotechnology 27:1173-1175 (2009)

Patwardhan RP, Hiatt JB, Witten DM, Kim MJ, Smith RP, May D, Lee C, Andrie JM, Lee S-I, Cooper GM, et al. Massively parallel functional dissection of mammalian enhancers in vivo. Nature Biotechnology 30:265-270 (2012).

Smith JD, McManus KF, and Fraser HB. A Novel Test for Selection on cis-Regulatory Elements Reveals Positive and Negative Selection Acting on Mammalian Transcriptional Enhancers.    Molecular Biology and Evolution 30:2509-2518 (2013).

Zhen Y and Andolfatto P. Methods to Detect Selection on Noncoding DNA in Evolutionary Genomics: Statistical and Computational Methods, Volume 2, Methods in Molecular Biology, vol. 856, edited by Anisimova M. Humana Press, New York (2012).

Paper author Justin Smith is a graduate student in Hunter Fraser's lab.

Paper author Justin Smith is a graduate student in Hunter Fraser’s lab.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s