BAPGXII Saturday May 30, 2015

logo with APG-p19lu5oeag1iikbhs1s351gbf18j8Stanford is hosting the 12th Bay Area Population Genomics (BAPG) meeting. The Bay Area Population Genomics meeting is a great place to (re)connect with your pop gen/genomics colleagues in the area and to present your work in a talk or a poster.

BAPGXI, held in December at UC Davis, was a great event with over 100 participants and a line up of excellent talks. Thanks to the Coop lab! You can read more here, including the storified tweets. We are excited to continue this success at Stanford!


UPDATE: Click here for detailed event program.

The meeting will take place on May 30th on the Stanford campus in the Alway building, room M106. We start at 8:30AM with breakfast and registration, Dr. Dmitri Petrov’s opening remarks will begin at 9:25am, and the first talk will be at 9:30am. The last talk (Dr. Jonathan Pritchard’s keynote) ends at 2:10pm, followed by a poster session with amazing wine, beer, and cheese! Here is a general outline of the agenda, to help you plan your day:

Breakfast and Registration in Alway Courtyard 8:30-9:25am (pick up your BAPGXII gift!)
Opening Remarks 9:25-9:30am
Talk Session 1 9:30-10:30am (20 mins per talk)
Coffee Break in Courtyard 10:30-11am
Talk Session 2 11am-12pm (20 mins per talk)
Lunch in Courtyard 12-1pm
Talk Session 3 and Keynote 1-2:10pm (2 20 min talks and 1 30 min talk)
Poster Session with Wine, Beer, and Cheese Reception at 2:10pm, ends at 3pm

Talks and Posters

Sorry. Speaker and poster slots are now full. No longer accepting sign-ups.

How to Attend BAPGXII

1. Please register here by 10am Friday, May 29th to join us at BAPGXII. Registration is free and open to all, but required.

2. Encourage your colleagues to sign up! Forward this email to your lab mailing list and watch for updates on the CEHG Facebook page and on Twitter @StanfordCEHG. Help us get the momentum going by tweeting us using #BAPGXII.

3. And finally, once you’ve signed up, all you need to do is get up early and ride-share, VTA/Caltrain or bike to our beautiful campus on May 30th. Come for the science, stay for the social! Use the Stanford campus map and this Google Map to find the Alway Building, located at 500 Pasteur Drive, Stanford, CA. Be green and consider ride-sharing: there is a dedicated tab for making travel plans in the sign up doc!

We hope to see you at Stanford!

The BAPGXII organizing committee: Bridget Algee-Hewitt (@BridgetAH), David Enard (@DavidEnard), Katie Kanagawa (@KatieKanagawa), Alison Nguyen, Dmitri Petrov (@PetrovADmitri), Susanne Tilk, and Elena Yujuico. If you have any questions, feel free to contact Bridget Algee-Hewitt at

To follow BAPGXII on twitter, check out the hashtag: #BAPGXII and also follow @StanfordCEHG .


Afterword: CEHG Genetics and Society Symposium 2015


Founded in 2012, CEHG is a research program that fosters interdisciplinary research. Home to more than 25 faculty and more than 200 grads and postdocs, CEHG bridges the divides between various member labs across Stanford campus.

The 2015 CEHG Genetics and Society Symposium (GSS15), which took place on April 13th and 14th in Stanford’s Paul Brest Hall, was a smashing success. It featured 25 speakers from Stanford campus and the San Francisco Bay academic and scientific industry communities. Approximately 175 Stanford affiliates and non-affiliates came together to celebrate the Center’s spirit of interdisciplinary collaboration and meet with experts in the fields of computational, evolutionary and human genomics This is a significant increase from last year’s 150 attendees!

The Mission:

The Genetics and Society Symposium is integral to CEHG’s mission: it provides postdocs and graduate fellows with the opportunity to share their developing research with faculty advisors and their colleagues, encourages conversation between faculty working in diverse scientific disciplines across campus, and introduces CEHG members to speakers from around the Bay Area and beyond (and vice versa).

The Venue:

As you can see from our photos of the space and catering service, Paul Brest Hall was the perfect home for this year’s two-day symposium. The hall was spacious, the food delicious, the staff hands on, and the outdoor picnic area well suited for our lunch and coffee breaks. We enjoyed the venue so much, in fact, that CEHG staff are currently in the process of booking the space for next year!

The Speakers:

GSS15 featured four brilliant keynote speakers, each distinguished in his/her field of research.

Gene Myers and CEHG Exec Committee members Marc Feldman, Chiara Sabatti, and Carlos Bustamante

Gene Myers and CEHG Exec Committee members Marc Feldman, Chiara Sabatti, and Carlos Bustamante

Founding director of a new Systems Biology Center at the Max-Planck Institute of Molecular Cell Biology and Genetics, Dr. Eugene (Gene) Myers presented his open-sourced research on the resurrection of de novo DNA sequencing. Best known for the development of BLAST, the most widely used tool in bioinformatics and the assembler he developed at Celera that delivered the fly, human, and mouse genomes in a three-year period, Dr. Myers participated in GSS15, courtesy of DNAnexus. Follow his blog:

Co-founding director Carlos Bustamante and Ed Green catch up during a break at GSS15.

Co-founding director Carlos Bustamante and Ed Green catch up during a break at GSS15.

Assistant Professor in Biomolecular Engineering at the University of California, Santa Cruz, Richard (Ed) Green presented his research on a novel approach for highly contiguous genome assemblies, which draws on his work as an NSF Fellow at the Max Planck Institute in Leipzig, Germany and head of an analysis consortium responsible for publishing the draft genome sequence of Neanderthal. Click here for his 2014 CARTA talk, “The Genetics of Humanness: The Neanderthal and Denisovan Genomes.

Dr. Michelle Mello, Stanford Law School and School of Medicine

Dr. Michelle Mello, Stanford Law School and School of Medicine

Michelle Mello, Professor of Law at Stanford Law School and Professor of Health Research and Policy in Stanford’s School of Medicine, presented findings from her extensive research on the ethics of data sharing. As the author of more than 140 articles and book chapters on the medical malpractice system, medical errors and patient safety, public health law, research ethics, the obesity epidemic, and pharmaceuticals, Dr. Mello provided a valuable perspective from the intersections of law, ethics, and health policy. Click here to read Dr. Mello’s SLS profile.

Dr. Ami Bhatt, Stanford Medicine

Dr. Ami Bhatt, Stanford Medicine

Ami Bhatt shared her passion for improving outcomes for patients with hematological malignancies in her talk, “Bugs, drugs, and cancer.” Best known for her recent work demonstrating the discovery of a novel bacterium using sequence-based analysis of a diseased human tissue, her research has been presented nationally and internationally and published in 2013 in the New England Journal of Medicine. Click here for links to Dr. Bhatt’s CAP profile and lab homepage.


We had a large group of CEHG faculty members at this year’s event, showcasing the cutting edge research being done in CEHG labs across Stanford campus and indicating considerable faculty commitment to ensuring the Center’s continuing success.

Our symposium would not be complete without our invited CEHG Fellows. These speakers were nominated by organizing committee members to present on topics relating to their CEHG-funded research projects. These young scholars embody CEHG’s continuing commitment to provide funding support to researchers as they transition from graduate studies to postdoctoral scholarships.

The Workshop:

There was standing room only when facilitators Chiara Sabatti (Associate Professor of Health Research and Policy at Stanford), Ken Lange (Chair of the Human Genetics Department at UCLA), and Suyash Shringarpure (postdoctoral scholar in Stanford’s Bustamante Lab) presented their approaches to contemporary problems in statistical genetics!

Social Media:

Did you know? CEHG is on social media!

GSS15 social media moderators, Bridget Algee-Hewitt, Jeremy Hsu, Katie Kanagawa, and Rajiv McCoy were posting live throughout both days of the event. And our efforts to reach the larger community paid off, with a total reach of 815 on Facebook and more than 7,000 impressions on Twitter!

To catch up on our GSS15 coverage, check out our Facebook page at and our Twitter feed @StanfordCEHG. Follow both to make sure you are the first to know when we post CEHG-related news and announcements.

Want to know when speaker videos from the symposium will be available on CEHG’s forthcoming youtube channel? Follow us on Facebook and Twitter!

Special Thanks:

From left to right: Bridget Algee-Hewitt, Cody Sam, Yang Li, Anand Bhaskar, and Katie Kanagawa

From left to right: Bridget Algee-Hewitt, Cody Sam, Yang Li, Anand Bhaskar, and Katie Kanagawa

The GSS15 organizing committee—including Bridget Algee-Hewitt, Anand Bhaskar, Katie Kanagawa, Yang Li, and Cody Sam—would like to take this opportunity to thank CEHG Directors Carlos Bustamante and Marc Feldman, Executive Committee members Hank Greely, Dmitri Petrov, Noah Rosenberg, and Chiara Sabatti, event volunteers Alex Adams, Maude David, and Chris Gignoux, event photographer Deneb Semprum, and everyone who attended this year’s symposium.

We hope you enjoyed attending as much as we enjoyed working behind-the-scenes. We hope to see you all again at GSS16! If you are interested in volunteering for future CEHG events, please contact us at

Upcoming CEHG events:

Don’t miss our popular weekly Evolgenome seminar series, which will continue through Spring term, usually on Wednesdays at noon (location varies). Lunch is always provided. Details will follow, but here is a quick overview so you can mark your calendars!

April 29: Fernando Racimo (Nielsen/Slatkin Lab)
May 6: Pleuni Pennings (UCSF)
May 20: Kelly Harkin
June 3: Sandeep Ventakaram (Petrov Lab)
June 10: Emilia Huerta-Sanchez

A framework for identifying and quantifying fitness effects across loci

Blog author Ethan Jewett is a PhD student in the lab of Noah Rosenberg.

Blog author Ethan Jewett is a PhD student in the lab of Noah Rosenberg.

The degree to which similarities and differences among species are the result of natural selection, rather than genetic drift, is a major question in population genetics. Related questions include: what fraction of sites in the genome of a species are affected by selection? What is the distribution of the strength of selection across genomic sites, and how have selective pressures changed over time? To address these questions, we must be able to accurately identify sites in a genome that are under selection and quantify the selective pressures that act on them.

Difficulties with existing approaches for quantifying fitness effects    

A recent paper in Trends in Genetics by David Lawrie and Dmitri Petrov (Lawrie and Petrov, 2014) provides intuition about the power of existing methods for identifying genomic regions affected by purifying selection and for quantifying the selective pressures at different sites. The paper proposes a new framework for quantifying the distribution of fitness effects across a genome. This new framework is a synthesis of two existing forms of analysis – comparative genomic analyses to identify genomic regions in which the level of divergence among two or more species is smaller than expected, and analyses of the distribution of the frequencies of polymorphisms (the site frequency spectrum, or SFS) within a single species (Figure 1). Using simulations and heuristic arguments, Lawrie and Petrov demonstrate that these two forms of analysis can be combined into a framework for quantifying selective pressures that has greater power to identify selected regions and to quantify selective strengths than either approach has on its own.

Figure 1. Using the quantify the strength of purging selection. The SFS tabulates the number of polymorphisms at a given frequency in a sample of haplotypes. Under neutrality (black dots) many high-frequency polymorphisms are observed. Under purifying selection (higher values of the effective selection strength |4Nes|), a higher fraction of new mutations are deleterious, leading to fewer high-frequency polymorphisms (red and blue dots). Adapted from Lawrie and Petrov (2014).

Figure 1. Using the site frequency spectrum (SFS) to quantify the strength of purifying selection. The SFS tabulates the number of polymorphisms at a given frequency in a sample of haplotypes. Under neutrality (black dots) many high-frequency polymorphisms are observed. Under purifying selection (higher values of the effective selection strength |4Nes|), a higher fraction of new mutations are deleterious, leading to fewer high-frequency polymorphisms (red and blue dots). Adapted from Lawrie and Petrov (2014).

Lawrie and Petrov begin by discussing the strengths and weaknesses of the two existing approaches. Comparative analyses of genomic divergence are beneficial for identifying genomic regions under purifying selection, which will exhibit lower-than-expected levels of divergence among species. However, as Lawrie and Petrov note, it can be difficult to use comparative analyses to quantify the strength of selection in a region because even mild purifying selection can result in complete conservation among species within the region (Figure 2). For example, whether the population-scaled selective strength, 4Nes, in a region is 20 or 200, the same genomic signal will be observed, complete conservation.

Figure 1. Adapted from Lawrie and Petrov (2013). The evolution of several 100kb regions was simulated in 32 different mammalian species under varying strengths of selection |4Nes|. The number of substitutions in each region was then estimated using genomic evolutionary rate profiling (GERP). The plot shows the median across regions of the number of inferred substitutions. From the plot, it can be seen that, once the strength of selection exceeds a weak threshold value (3 for the example given), there is full conservation among species.

Figure 1. Adapted from Lawrie and Petrov (2013). The evolution of several 100kb regions was simulated in 32 different mammalian species under varying strengths of selection |4Nes|. The number of substitutions in each region was then estimated using genomic evolutionary rate profiling (GERP). The plot shows the median across regions of the number of inferred substitutions. From the plot, it can be seen that, once the strength of selection exceeds a weak threshold value (3 for the example given), there is full conservation among species.

In contrast to comparative approaches, analyses of within-species polymorphisms based on the site frequency spectrum (SFS) within a region can be used to more precisely quantify the strength of selection. For example, Figure 1 shows that different selective strengths can produce very different site frequency spectra. Moreover, if the SFS can be estimated precisely enough, it can allow us to distinguish between two different selective strengths (e.g., 4Nes1 = 20 and 4Nes2 = 200) that would both lead to total conservation in a comparative study, and would therefore be indistinguishable. The problem is that it takes a lot of polymorphisms to obtain an accurate estimate of the SFS, and a genomic region of interest may contain too few polymorphisms, especially if the region is under purifying selection, which decreases the apparent mutation rate. Sampling additional individuals from the same species may provide little additional information about the SFS because few novel polymorphisms may be observed in the additional sample. For example, recall that for a sample of n individuals from a wildly idealized panmictic species, the expected number of novel polymorphisms observed in the n+1st sampled individual is proportional to 1/n (Watterson1975).

A proposed paradigm

Lawrie and Petrov demonstrate that studying polymorphisms by sampling many individuals across several related species (rather than sampling more individuals within a single species) could increase the observed number of polymorphisms in a region, and therefore, could increase the power to quantify the strength of selection (Figure 3) – as long as the selective forces in the genomic region are sufficiently similar across the different species.


Figure 3. The benefits of studying polymorphisms in many populations, rather than within a single population. Three populations (A, B, and C) diverge from an ancestral population, D. The genealogy of a single region is shown (slanted lines) with mutations in the region denoted by orange slashes. Additional lineages sampled in population A are likely to coalesce recently with other lineages (for example, the red clade in population A ) and, therefore, carry few mutations that have not already been observed in the sample. In comparison, the same number of lineages sampled from a second population are likely to carry additional independent polymorphisms (for example, the red lineages in population B). If the selective pressures at the locus in populations A and B are similar, then the SFS in the two populations should be similar, and the additional lineages in B can provide additional information about the SFS. For example, if the demographic histories and selective pressures at the locus are identical in populations A and B, and if the samples from populations A and B are sufficiently diverged, then a sample of K lineages from each population, A and B, will contain double the number of independent polymorphisms that are observed in a sample of K lineages from population A alone, providing double the number of mutations that can be used to estimate the SFS.

The need for sampling depth and breadth

Without getting bogged down in the details, it’s the rare variants that are often the most important for quantifying the effects of purifying selection, so one still has to sample deeply within each species; however, overall, sampling from additional species is a more efficient way of increasing the absolute number of variants that can be used to estimate the SFS in a region, compared with sampling more deeply within the same species.

The simulations and heuristic arguments presented by Lawrie and Petrov consider idealized cases for simplicity; however, the usefulness of approaches that consider polymorphisms across multiple species has been demonstrated in methods such as the McDonald-Kreitman test (McDonald and Kreitman, 1991), which have long been important tools for studying selection. More recent empirical applications of approaches that consider information about polymorphisms across multiple species appear to do a good job of quantifying selective pressures across genomes (Wilson et al., 2011; Gronau et al., 2013), even when species are closely related (De Maio et al., 2013). Overall, the simulations and arguments presented in Lawrie and Petrov’s paper provide useful guidelines for researchers interested in identifying and quantifying selective forces, and their recommendation to sample deeply within species and broadly across many species comes at a time when such analyses are becoming increasingly practical, given the recent availability of sequencing data from many species.


  1. De Maio, N., Schlötterer, C., and Kosiol, C. (2013). Linking great apes genome evolution across time scales using polymorphism-aware phylogenetic models. Molecular biology and evolution30:2249-2262.
  2. Gronau, I., Arbiza, L., Mohammed, J., and Siepel, A. (2013). Inference of natural selection from interspersed genomic elements based on polymorphism and divergence. Molecular biology and evolution30:1159-1171.
  3. Lawrie, D.S. and Petrov, D.A. (2014). Comparative population genomics: power and principles for the inference of functionality. Trends in Genetics30:133-139.
  4. Watterson, G.A. (1975). On the number of segregating sites in genetical models without recombination. Theoretical population biology7:256-276.
  5. Wilson, D.J., Hernandez, R.D., Andolfatto, P., and Przeworski, M. (2011). A population genetics-phylogenetics approach to inferring natural selection in coding sequences. PLoS genetics7:e1002395.

Paper author: David Lawrie was a graduate student in Dmitri Petrov’s lab. He is now a postdoc at USC.




Testing for selection in regulatory sequences using an empirical mutational distribution

How to detect selection?

Dave Yuan is a postdoc in Dmitri Petrov's lab.

Blog author Dave Yuan is a postdoc in Dmitri Petrov’s lab.

Detecting and quantifying selection in genomes is a fundamental task of interest for evolutionary biologists. A common method for this relies on comparing patterns of polymorphism and divergence between synonymous and non-synonymous sites. Synonymous sites are expected to be almost neutral, and thus mutations at these sites are expected to be fixed or lost due to genetic drift or draft. At non-synonymous sites however, mutations may get fixed due to positive selection or lost due to purifying selection. If in a specific gene, many non-synonymous sites get fixed due to positive selection, then these sites as a group will show a high evolutionary rate. On the other hand, if in a specific gene most non-synonymous mutations are lost because of purifying selection, then these sites will show a low evolutionary rate. Importantly, to determine whether the rate is high or low, we need a group of sites that can be used as a neutral comparison. For coding regions, synonymous sites are a natural choice for this comparison. [McDonald & Kreitman 1991; Keightley & Eyre-Walker 2007, Bustamante et al. 2001].

What about non-coding sequences?

Much of the genome, however, is comprised of non-coding sequence. Such sequence may contain regulatory information critical for gene expression, the modification of which is important for phenotypic evolution. Detecting selection among regulatory variation is thus of interest to evolutionary biologists, but this has been challenging. This is because functional annotation of non-coding DNA tends to be sparse, and we currently do not understand the “regulatory genetic code.” Although selection tests developed for coding sequence have been applied to non-coding sequence [reviewed in Zhen & Andolfatto 2012], a common impediment has been the choice of a group of sites that can function as a neutral comparison. A solution to this is to generate a large number of mutations in a specific region of the genome and determine whether these mutations have functional impacts. The sites at which mutations do not appear to have function can then be used to compare other groups of sites with. In a recent paper published in Molecular Biology and Evolution, graduate students Justin Smith and Kimberly McManus and CEHG faculty Hunter Fraser describe their development and application of this novel method to test for selection among variation in mammalian regulatory elements using such null distribution of mutations.

Null distribution of random mutations

Mutagenesis technique used by Patwardhan et al. (2012) to generate a comprehensive collection of cis-regulatory element mutants and test their phenotypes in vivo (figures from Patwardhan et al., 2012)

Mutagenesis technique used by Patwardhan et al. (2012) to generate a comprehensive collection of cis-regulatory element mutants and test their phenotypes in vivo (figures from Patwardhan et al., 2012)

Generating an empirical null distribution as the neutral comparison is not a trivial task. A sufficiently large—ideally comprehensive—set of mutations needs to be engineered into the regulatory element of interest, and the mutational effects or phenotypes need to be assessed. This distribution of phenotypes is then the null distribution against which the observed variation is compared to test for selection. Fortunately, recent developments in mutagenesis coupled with high-throughput sequencing have made this possible in high-resolution. Smith et al. chose data from one such mutagenesis platform that generated over 640,000 mutant haplotypes across three mammalian enhancer sequences [Patwardhan et al. 2012]. Specifically, the library of mutant enhancers was made using polymerase cycling assembly (PCA) with oligonucleotides containing between 2-3% degeneracy. All possible single nucleotide variants of the wild-type enhancer were thus represented. The library of enhancers was then cloned into a plasmid upstream of a reporter gene along with unique identification tags. This plasmid library was both sequenced to identify the tag corresponding with the mutant enhancer and injected into mouse for in vivo reporter assay. Finally, sequencing of the cDNA from the mouse liver quantified the transcriptional abundance of the tags and hence the phenotypic effects of the mutations. For each mutation it was now clear whether it upregulated or downregulated the reporter gene or whether it had no effect.

Developing a test to compare mutations and observed variation

With this dataset, Smith et al. had a comprehensive spectrum of random mutations and their phenotypic effects as the null distribution. This allowed them to create metrics for regulatory variation that are similar to the commonly-used Ka/Ks ratio, with Ka being the rate of non-synonymous change and Ks the rate of synonymous change (no functional impact on protein and hence neutral) [Kimura 1977]. The in vivo reporter assay revealed mutations with no phenotypic impact (i.e. no change in transcriptional abundance compared to wild-type), and these are analogous to synonymous or neutral changes. The new metrics are dubbed Ku/Kn, and Kd/Kn, where Ku is the rate of change for up-regulatory mutations (those with increased expression from the in vivo reporter assay), Kd is the rate of change for down-regulatory mutations, and Kn the rate of change for mutations that didn’t change expression (silent or neutral mutations).

Metrics to compare observed mutations in the phylogeny to possible mutations seen in the mutagenesis data (Figure 1 from Smith et al 2013).

Metrics to compare observed mutations in the phylogeny to possible mutations seen in the mutagenesis data (Figure 1 from Smith et al 2013).

For their analysis, the authors chose enhancer sequences from species within the same phylogenetic orders as the mutagenized enhancers. In addition to enhancer sequences from extant species, the authors also reconstructed ancestral sequences throughout the phylogeny. Combined with the mutagenesis data, each K metric at a node in the phylogeny is then calculated as the ratio of observed (i.e. in ancestors and extant species) frequencies of silent, up-, or down-regulatory polymorphisms to the frequencies of all possible silent, up-, or down-regulatory mutations respectively. Selection is inferred by comparing the ratio of up- or down-regulatory polymorphisms to the ratio of silent mutations (i.e. Ku/Kn or Kd/Kn). A comparatively low Ku or Kd, or rate of up- or down-regulatory mutations (Ku/Kn or Kd/Kn < 1) would suggest purifying selection on the polymorphisms, while a higher rate of up- or down-regulatory mutations (Ku/Kn or Kd/Kn > 1) would suggest positive selection. Smith et al. applied their new test for selection on the three enhancers from [Patwardhan et al. 2012] across the respective phylogenetic orders: LTV1 in rodents and ALDOB and ECR11 in primates. They detected purifying selection against down-regulatory polymorphisms for all three enhancers, while positive selection for up-regulatory polymorphisms was also detected for LTV1.

Detecting selection using an empirically-derived null distribution

Making evolutionary sense of variation in the regulatory regions of the genome remains more challenging than for coding sequences. We still do not have a “neutral model of regulatory evolution” to compare observed variation against. Perhaps the most exciting element of this paper, at least for me, is the use of an empirically-derived null distribution as the neutral expectation to perform this evolutionary inquiry. Patwardhan and the Shendure group at the University of Washington had earlier published a mutagenesis technique that generated a wide spectrum of mutants [Patwardhan et al. 2009]. At this time, I was getting interested in questions on the “grammar” of gene regulation, the functional characterization of regulatory sequences, and how to understand regulatory variation evolutionarily. It was thus very exciting to see both, a massively comprehensive interrogation of the mutational consequences in a regulatory element, as well as the clever application of this data to overcome a challenging evolutionary question.

One of the strengths of the Smith et al. study is the reliance on a spectrum of random mutations as the null distribution. As the original source of all genetic variation, mutations arise in a random manner. Of those that do not exert lethal effects, they may persist by chance within a population and then eventually reach certain frequencies or even fixation under selection. Because the null distribution used by Smith et al. comprises all possible mutations, it represents the mutation spectrum prior to the actions of drift or selection. It is thus an even better neutral expectation than synonymous mutations, which may not be truly neutral. In addition, using such empirical null distribution to test for selection is not limited to just regulatory variation but can be applied to coding sequence variation to reduce bias and false signals. Furthermore, by categorizing mutational effects as up- and down-regulatory, different modes of selection acting on a regulatory element can be teased apart. The interspersion of mutations—silent, up-, or down-regulatory—across the regulatory element also reduces confounding effects of regional variation in mutation rate.

As with all science, more is hoped for the future. Towards the end of the paper, the authors discuss prospects of more high-resolution mutagenesis data and, perhaps more importantly in terms of accessibility and ease of use, ability to use limited mutagenesis to test selection with. Tissue- and organism-specificity in terms of mutational effects may also be further investigated, as well as the inclusion of mutation types other than single nucleotide substitutions (e.g. insertion/deletion, copy number variation) or consideration of genomic regional context (e.g. effect of chromatin or epistasis). Nevertheless, this study represents an exciting new method to investigate regulatory variation in evolutionary contexts, one whose development and further application I look forward to seeing.


Bustamante CD, Wakeley J, Sawyer S, and Hartl DL. Directional Selection and the Site-Frequency Spectrum. Genetics 159:1779-1788 (2001).

Keightley PD and Eyre-Walker A. Joint inference of the distribution of fitness effects of deleterious mutations and population demography based on nucleotide polymorphism frequencies. Genetics 177:2251-2261 (2007).

Kimura M. Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution. Nature 267:275-276 (1977).

McDonald JH and Kreitman M. Adaptive Protein Evolution at the Adh Locus in Drosophila. Nature 351:652-654 (1991).

Patwardhan RP, Lee C, Litvin O, Young DL, Pe’er D, and Shendure J. High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis. Nature Biotechnology 27:1173-1175 (2009)

Patwardhan RP, Hiatt JB, Witten DM, Kim MJ, Smith RP, May D, Lee C, Andrie JM, Lee S-I, Cooper GM, et al. Massively parallel functional dissection of mammalian enhancers in vivo. Nature Biotechnology 30:265-270 (2012).

Smith JD, McManus KF, and Fraser HB. A Novel Test for Selection on cis-Regulatory Elements Reveals Positive and Negative Selection Acting on Mammalian Transcriptional Enhancers.    Molecular Biology and Evolution 30:2509-2518 (2013).

Zhen Y and Andolfatto P. Methods to Detect Selection on Noncoding DNA in Evolutionary Genomics: Statistical and Computational Methods, Volume 2, Methods in Molecular Biology, vol. 856, edited by Anisimova M. Humana Press, New York (2012).

Paper author Justin Smith is a graduate student in Hunter Fraser's lab.

Paper author Justin Smith is a graduate student in Hunter Fraser’s lab.

Caught in the act: how drug-resistance mutations sweep through populations of HIV

Blog author Meredith Carpenter is a postdoc in Carlos Bustamante's lab.

Blog author Meredith Carpenter is a postdoc in Carlos Bustamante’s lab.

It has been over 30 years since the emergence of HIV/AIDS, yet the disease continues to kill over one million people worldwide per year [UNAIDS report]. One of the reasons that this epidemic has been so difficult to control is because HIV evolves quickly—it has a short replication time and a high mutation rate, so viruses harboring new mutations that confer drug resistance tend to arise often and spread quickly.

However, the likelihood of one of these beneficial mutations popping up and subsequently “sweeping” through the viral population—i.e., becoming more common because of the survival advantage—also depends on the underlying population genetics, much of which is still poorly understood. In a paper just published in PLoS Genetics, Pleuni Pennings, postdoc in the Petrov lab, and colleagues Sergey Kryazhimskiy and John Wakeley from Harvard tracked the genetic diversity in adapting populations of HIV to better understand how and when new mutations arise.

Mutations and populations

Mutations are usually caused by either DNA damage (e.g., from environmental factors like UV radiation) or by a mistake during DNA replication. Because HIV is a retrovirus, meaning it must copy its RNA genome into DNA before it can be reproduced in the host cell, it is especially prone to errors that happen during the replication process. The rate that these errors occur, also called the mutation rate, is constant on a per-virus basis —for example, a specific mutation might happen in one virus in a million. As a consequence, the overall number of viruses in the population determines how many new mutations will be present, with a larger population harboring more mutations at any given time.

Whether these mutations will survive, however, is related to what population geneticists call the “effective population size” (also known as Ne), which takes into account genetic diversity. Due to a combination of factors, including the purely random destruction of some viruses, not all mutations will be preserved in the population, regardless of how beneficial they are. The Ne is a purely theoretical measure that can tell us how easily and quickly a new mutation can spread throughout a population. Because it accounts for factors that affect diversity, it is usually smaller than the actual (or “census”) population size.

Pennings and colleagues wanted to determine the Ne for HIV in a typical patient undergoing drug treatment. This is a contentious area: previous researchers examining this question using different methods, including simply summing up overall mutation numbers, came up with estimates of Ne ranging from one thousand to one million (in contrast, the actual number of virus-producing cells in the body is closer to one hundred million, but more on that later). To get a more exact estimate, Pennings took a new approach. Using previously published DNA sequences of HIV sampled from patients over the course of a drug treatment regimen, she looked at the actual dynamics of the development of drug-resistant virus populations over time.

Swept away

Specifically, Pennings focused on selective sweeps, wherein an advantageous mutation appears and then rises in frequency in the population. Features of these sweeps can give estimates of Ne because they reveal information about the diversity present in the initial population. Pennings sought to distinguish between “hard” and “soft” selective sweeps occurring as the viruses became drug resistant. A hard sweep occurs when a mutation appears in one virus and then rises in frequency, whereas a soft sweep happens when multiple viruses independently gain different mutations, which again rise in frequency over time (see Figure 1). These two types of sweeps have distinct fingerprints, and their relative frequencies depend on the underlying effective population size—soft sweeps are more likely when a population is larger it becomes more likely for different beneficial mutations to independently arise in two different viruses. Soft sweeps also leave more diversity in the adapted population compared to hard sweeps (Figure 1).

Figure 1, an illustration of a hard sweep (left) and a soft sweep (right).

Figure 1, an illustration of a hard sweep (left) and a soft sweep (right).

To tell these types of sweeps apart, Pennings took advantage of a specific amino acid change in the HIV gene that encodes reverse transcriptase (RT). This change can result from two different nucleotide changes, either one of which will change the amino acid from lysine to asparagine and confer resistance to drugs that target the RT protein.  Pennings used this handy feature to identify hard and soft sweeps: if she observed both mutations in the same drug-resistant population, then the sweep was soft. If only one mutation was observed, the sweep could be soft or hard, so she also factored in diversity levels to tell these apart. Pennings found evidence of both hard and soft sweeps in her study populations. Based on the frequencies of each, she estimated the Ne of HIV in the patients. Her estimate was 150,000, which is higher than some previous estimates but still lower than the actual number of virus-infected cells in the body. Pennings suggests that this discrepancy could be due to the background effects of other mutations in the viruses that gain the drug-resistance mutation—that is, even if a virus gets the valuable resistance mutation, it might still end up disappearing from the population because it happened to harbor some other damaging mutation as well. This would reduce the effective population size as measured by selective sweeps.

Implications and future work

Pennings’ findings have several implications. The first is that HIV populations have a limited supply of resistance mutations, as evidenced by the presence of hard sweeps (which, remember, occur when a sweep starts from a single mutation). This means that even small reductions in Ne, such as those produced by combination drug therapies, could have a big impact on preventing drug resistance. The second relates to the fact that, as described above, the likelihood that a mutation will sweep the population may be affected by background mutations in the virus in which it appears. This finding suggests that mutagenic drugs, given in combination with standard antiretrovirals, could be particularly useful for reducing drug resistance.  Now, Pennings is using larger datasets to determine whether some types of drugs lead to fewer soft sweeps (presumably because they reduce Ne). She is also trying to understand why drug resistance in HIV evolves in a stepwise fashion (one mutation at a time), even if three drugs are used in combination.

Paper author Pleuni Pennings is a postdoc in the lab of Dmitri Petrov.

Paper author Pleuni Pennings is a postdoc in the lab of Dmitri Petrov.


Pennings, PS, Kryazhimskiy S, Wakeley J. Loss and recovery of genetic diversity in adapting HIV populations. 2014, PLoS Genetics.

Demographic inference from genomic data in nonmodel insect populations

Blog author Martin Sikora is a postdoc in the lab of Carlos Bustamante.

Blog author Martin Sikora is a postdoc in the lab of Carlos Bustamante.

Reconstructing the demographic history of species and populations is one of the major goals of evolutionary genetics. Inferring the timing and magnitude of past events in the history of a population is not only of interest in its own right, but also in order to form realistic null models for the expected patterns of neutral genetic variation in present-day natural populations. A variety of methods exist that allow the inference of these parameters from genomic data, which, in the absence of detailed historical records in most situations, is often the only feasible way to obtain them. As a consequence, it is generally not possible to empirically validate the parameters inferred from genomic data in a direct comparison with a known “truth” from a natural population. Furthermore, until recently, the application of these methods was limited to model organisms with well-developed genomic resources (e.g., humans and fruitflies), excluding a large number of non-model organisms with potentially considerable evolutionary and ecological interest.

Chasing butterflies?

In an elegant study recently published in the journal Molecular Ecology, Rajiv McCoy, a graduate student with Dmitri Petrov and Carol Boggs, and colleagues tackle both of these problems in natural populations of Euphydryas gillettii, a species of butterfly native to the northern Rocky Mountains. About 30 years ago, a small founder population of this species from Wyoming was intentionally introduced to a new habitat at the Rocky Mountain Biological Laboratory field site in Colorado, and population sizes were recorded every year since the introduction. The beauty of this system is that it allows the authors to perform a direct comparison of the known demography (i.e. a recent split from the parental population and bottleneck ~30 generations ago, with census data in the newly introduced population) with estimates inferred from genomic data.

Gillete’s Checkerspot (Euphydryas gillettii). Photo taken by Carol Boggs, co-advisor of Rajiv and one of the senior authors of the study.

Gillete’s Checkerspot (Euphydryas gillettii). Photo taken by Carol Boggs, co-advisor of Rajiv and one of the senior authors of the study.

A genomic dataset from a non-model organism

The researchers sampled eight larvae each from both the parental as well as the derived population for this study. In the world of model organisms, the next steps for constructing the dataset would be straightforward: Extract genomic DNA, sequence to the desired depth, map to the reference genome and finally call SNPs. In the case of E. gillettii however, no reference genome is available, so the authors had to use a different strategy. They decided to use RNA-sequencing in order to first build a reference transcriptome, which was then used as a reference sequence to map against and discover single nucleotide variants. An additional advantage of this approach is that the data generated can potentially also be utilized for other types of research questions, such as analyses of gene expression differences between the populations. On the downside, SNP calling from a transcriptome without a reference genome is challenging and can lead to false positives, for example due to reads from lowly expressed paralogs erroneously mapping to the highly expressed copy present in the assembled transcriptome. The authors therefore went to great lengths to stringently filter these false positive variants from their dataset.

Demographic inference using δαδι

For the demographic inference, McCoy and colleagues used δαδι (diffusion approximation for demographic inference), a method developed by Ryan Gutenkunst while he was a postdoc in the group of CEHG faculty member Carlos Bustamante. This method uses a diffusion approximation to calculate the expected allele frequency spectrum under a demographic model of interest. The observed allele frequency spectrum is then fit to the expected spectrum by optimization of the demographic parameters to maximize the likelihood of the data. δαδι has been widely used to infer the demographic history of a number of species, from humans to domesticated rice, and is particularly suited to large-scale genomic datasets due to its flexibility and computational efficiency.

Excerpt of Figure 2 from McCoy et al., illustrating the demographic models tested using δαδι.

Excerpt of Figure 2 from McCoy et al., illustrating the demographic models tested using δαδι.

Models vs History

The authors then fit a demographic model reflecting the known population history of E. gillettii, as illustrated in Figure 2 of their article (Model A). Encouragingly, they found that the model provided a very good fit to the data, with an the estimate of the split time between 40 and 47 generations ago, which is very close to the known time of establishment of the Colorado population 33 generations ago. Furthermore, they also tested how robust these results were to using a misspecified demographic model, by incorporating migration between the Colorado and Wyoming populations in their model (which in reality are isolated from each other). However, both alternative models with migration (Models B1 and B2) did not significantly improve the fit, again nicely consistent with the known population history.

Three butterflies is enough?

Finally, the researchers also tested the robustness of the results to variations in the number of samples or SNPs used in the analysis, from datasets simulated under the best-fit model A. They found that δαδι performed remarkably well even with sample sizes as low as three individuals per population. While this is in principle good news for researchers limited by low number of available samples, one has to be aware of the fact that this results will be to a certain extent specific to this particular type of system, where one population undergoes a very strong bottleneck resulting in large effects on the allele frequency spectrum. A good strategy suggested by McCoy and colleagues is then to use these types of simulations in the planning stages of an experiment, in order to inform researchers of the number of samples and markers necessary to confidently estimate the demographic parameters of interest.

Conclusions and future directions

For me, this study is a great example of how next-generation sequencing and sophisticated statistical modeling can open up a new world of possibilities to researchers interested in the ecology and evolution of natural populations. McCoy and colleagues constructed their genomic dataset essentially from scratch, without the “luxuries” of a reference genome or database of known polymorphisms. Moving forward, Rajiv has been busy collecting more samples over the past year. He and his colleagues plan to sequence over a thousand of them for the next phase of the project, as well as assemble a reference genome for E. gillettii, and important next step in the development of genomic tools for this fascinating ecological system.

The author of the paper Rajiv McCoy, sampling larvae of Euphydryas gillettii

The author of the paper Rajiv McCoy, sampling larvae of Euphydryas gillettii

McCoy, R. C., Garud, N. R., Kelley, J. L., Boggs, C. L. and Petrov, D. A. (2013), Genomic inference accurately predicts the timing and severity of a recent bottleneck in a nonmodel insect population. Molecular Ecology. doi: 10.1111/mec.12591

Genomic analyses of ancestry of Caribbean populations

Blog author Rajiv McCoy is a graduate student in the lab of Dmitri Petrov.

Blog author Rajiv McCoy is a graduate student in the lab of Dmitri Petrov.

In the Author Summary of their paper, “Reconstructing the Population Genetic History of the Caribbean”, Andrés Moreno-Estrada and colleagues point out that Latinos are often falsely depicted as a homogeneous ethnic or cultural group.  In reality, however, Latinos, including inhabitants of the Caribbean basin, represent a diverse mixture of previously separate human populations, such as indigenous groups, European colonists, and West Africans brought over during the Atlantic slave trade.  This mixing process, which geneticists call “admixture”, left a distinct footprint on genetic variation within and between Caribbean populations.  By surveying genotypes of 330 Caribbean individuals and comparing to a database of variation from more than 3000 individuals from European, African, and Native American populations, Moreno et al., explore the genomic outcomes of this complex admixture process and reveal intriguing demographic patterns that could not be obtained from the historical record alone. The paper, featured in the latest edition of PLOS Genetics, represents a collaborative project with co-senior authorship by Stanford CEHG professor Carlos Bustamante and Professor Eden Martin from the University of Miami Miller School of Medicine.

Reconstructing the demographic history of admixed populations

Because parental DNA is only moderately shuffled before being incorporated into gametes (the process of meiotic recombination), admixture results in discrete genomic segments that can be traced to a particular ancestral population.  In early generations after the onset of admixture, these segments are large.  However, after many generations, segments will be quite small.  By investigating the distribution of sizes of these ancestry “tracts”, Moreno and colleagues inferred the timing of various waves of migration and admixture.  For Caribbean Island populations, they infer that European gene flow first occurred ~16-17 generations ago, which matches very closely to the historical record of ~500 years, assuming ~30 years per generation.  In contrast, for neighboring mainland populations from Colombia and Honduras, they find that European gene flow occurred in waves, starting more recently (~14 generations ago).

Identifying sub-continental ancestry of admixed individuals

Those familiar with human population genetics will recognize principal component analysis (PCA), which transforms a matrix of correlated observed genotypes into a set of uncorrelated variables where the first component explains the most possible variance, the second variable explains the second most variance, and so on.  Individuals’ transformed genotypes can be plotted on the first two principle components, and when performed on a worldwide scale, distinct clusters appear which represent populations of ancestry.  On conventional PCA plots, admixed individuals fall between their different ancestral populations, as they possess sets of genotypes diagnostic of multiple ancestral groups.  As virtually all Caribbean individuals are admixed to some degree, this pattern is apparent for Caribbean populations (see Figure 1B from the paper, reproduced below).


While interesting, this means that the sub-continental ancestry of these admixed individuals is difficult to ascertain.  An individual may want to know which Native American, West African, and European populations contribute to his or her ancestry, and this analysis does not have sufficient resolution to answer these questions.

Moreno and colleagues therefore devised a new version of PCA called ancestry-specific PCA (ASPCA), which extracts genomic segments assigned to Native American, West African, and European ancestry, then analyzes these segments separately, dealing with the large proportions of missing data that result.  In the case of Native American ASPCA, they observe two overlapping clusters.  The first represents mostly Colombians and Hondurans, who cluster most closely with indigenous groups from Western Colombia and Central America and have a greater overall proportion of Native American ancestry.  The second cluster represents mostly Cubans, Dominicans, and Puerto Ricans, who cluster most closely with Eastern Colombian and Amazonian indigenous groups.  This makes sense in light of the fact that Amazonian populations from the Lower Orinoco Valley settled on rivers and streams, which could have facilitated their migration.  Because indigenous ancestry proportions were relatively consistent and closely clustered across different Caribbean Islands, the authors posit that there was a single pulse of expansion of Amazonian natives across the Caribbean prior to European arrival, along with gene flow among the islands.

In the case of European ASPCA, Moreno et al. found that Caribbean samples clustered closest to, but clearly distinct from, present day individuals from the Iberian Peninsula in Southern Europe.  In fact, the differentiation between this “Latino-specific component” and Southern Europe is at least as great as the differentiation between Northern and Southern Europe.  The authors hypothesize that this is due to very small population sizes among European colonists, which would have introduced noise into patterns of genomic variation through the process of random genetic drift.

Finally, the authors demonstrate that Caribbean populations have a higher proportion of African ancestry compared to mainland American populations, a result of admixture during and after the Atlantic slave trade.  Surprisingly, the authors found that all samples tightly clustered with present day Yoruba samples from Nigeria rather than being dispersed throughout West Africa.  However, because other analyses suggested that there might have been two major waves of migration from West Africa, the authors decided to analyze “old” and “young” blocks of African ancestry separately.  This analysis revealed that “older” segments are primarily derived from groups from the Senegambia region of Northwest Africa, while “younger” segments likely trace to groups from the Gulf of Guinea and Equatorial West Africa (including the Yoruba).

Conclusions and perspectives

This groundbreaking study has immediate implications for the field of personalized medicine, especially due to the discovery of a distinct Latino-specific component of European ancestry.  The hypothesis that European colonists underwent a demographic bottleneck (a process termed the “founder effect”) has expected consequences for the frequency of damaging mutations contributing to genetic disease. The observation of extensive genetic differences among Caribbean populations also argues for more such studies characterizing genetic variation on a smaller geographic scale. The newly developed ASPCA method will surely be valuable for other admixed populations.  In addition to medical implications, studies such as this help dispel simplistic notions of race and ethnicity and inform cultural identities based on unique and complex demographic history.

Citation: Moreno-Estrada A, Gravel S, Zakharia F, McCauley JL, Byrnes JK, et al. (2013) Reconstructing the Population Genetic History of the Caribbean. PLoS Genet 9(11): e1003925. doi:10.1371/journal.pgen.1003925

Paper author Andres Moreno-Estrada is a research associate in the lab of Carlos Bustamante.

Paper author Andrés Moreno-Estrada is a research associate in the lab of Carlos Bustamante.