Identifying the patterns of spontaneous mutations

Blog author Ryo (Ryosuke) Kita is a graduate student in  Hunter Fraser's lab.

Blog author Ryo (Ryosuke) Kita is a graduate student in Hunter Fraser’s lab.

Spontaneous Mutations – friend or foe 

Evolution has conflicting opinions about spontaneous mutations. Spontaneous mutations produce the genetic variation that drives evolution in all organisms, but at the same time, most mutations that affect fitness are harmful for the organism. Despite being a pivotal component of evolution, our understanding of mutations is limited.

To understand the role of mutations in evolution, the following basic questions are essential: What types of spontaneous mutations occur? How frequently do they occur?

Such simple questions are surprisingly difficult to answer, but a recent study by Yuan Zhu (Zhu et al. 2014) has harnessed the power of next generation sequencing to get a better answer to these questions for the budding yeast S. cerevisiae.

A tricky measurement

Obtaining an unbiased measurement of spontaneous mutations is challenging because the nucleotide changes we observe can be biased by selection.

Imagine we sequence the genome of a yeast cell. Let’s then wait several generations and sequence its progeny. Sure, you may find a number of mutations – but is that really an unbiased measurement of spontaneous mutations? No! Progeny with deleterious or harmful mutations will be out-competed by others, so they will not be sequenced. In other words, this approach misses the deleterious mutations.

So, what’s the proper way to measure spontaneous mutations? A few methods have been used:

One method measures the mutations that occur in a location with no fitness effect, such as a pseudogene. Because mutations can occur within this location without affecting the fitness, this provides an unbiased picture of mutation rate. The downside is that the frequency and type of spontaneous mutations differ depending on the location of the genome, and this method is restricted to studying only one particular location.

To get a genome-wide approach, a pedigree-based method can be employed. This method looks at the differences between the parents and offspring, which provide an unbiased view of the mutational landscape (except for the most deleterious mutations). This method, however, is unfeasible for measuring many mutations because of the paucity of mutations occurring within one generation.

But not all hope is lost because there is a method that addresses both of the weaknesses presented above. This approach uses mutation accumulation (MA) lines – passaging an organism from generation to generation at very small population sizes. Maintaining small population sizes eliminates the effect of selection because the progeny are not allowed to compete with each other. The passaging can be performed for several generations resulting in a large number of spontaneous mutations (except again for the strongest deleterious mutations). MA lines have been studied for over 50 years in yeast, fruit flies, and nematodes, and have both reinforced and altered our understanding of spontaneous mutations and the distribution of fitness effects (Halligan and Keightley 2009).


FIGURE 1: Mutation Accumulation Lines (Taken from Halligan and Keightley 2009 Figure 1). Serial passaging of random single colonies while keeping the effective population size small eliminates the effects of selection, with the exception of the effects of strongly deleterious mutations.

300,000 generations of yeast

Although MA lines have been studied for so long, constraints in sequencing ability have prevented a large-scale analysis of the mutations. Combining the sequencing prowess of the Petrov Lab at Stanford and the MA lines from the Hall Lab at the University of Texas, Yuan and colleagues sequenced the genomes of 145 budding yeast diploid MA lines that were passaged for ~2000 generations each. This amounts to roughly 300,000 generations of mutation accumulation! Using innovative analyses to identify mutations with high confidence, they found numerous single nucleotide mutations, in addition to indels, CNVs, and whole-chromosome copy number changes.

Using this data, they were able to calculate the mutation rate in yeast. Their refinement of the mutation rate is valuable for molecular evolution models in yeast- but the authors also uncovered findings that are unique to this mutation accumulation study because of the whole-genome sequencing approach and the scale of the study. Here, two of these findings will be discussed: the prevalence of aneuploidies and the genomic-context of single nucleotide mutations.

Aneuploidy (almost) everywhere!

Just like in humans, aneuploidy can occur in yeast – but how often does it occur spontaneously? With their next-generation sequencing data, Zhu et al. were uniquely poised to answer this question. By analyzing the read-depth across chromosomes for each strain, they found many differences in whole-chromosome copy number. Roughly 20% of their strains (31/145) exhibited aneuploidy. And out of the 16 chromosomes in yeast, all but two had a duplication event (Figure 2).

Only a small fraction (2/31) of the aneuploidy events were chromosome deletions, while the rest were chromosome duplications. The relative lack of chromosome loss is likely because of its strongly deleterious effect. But the high prevalence of chromosome duplications raises a number of questions. For example, how common are aneuploidies in yeast strains used in other studies? Analyses on phenotypes or gene expression of yeast are often performed assuming that the strains of yeast are without aneuploidies, but an additional chromosome could affect such analyses significantly. To further investigate the role of these aneuplodies, a useful next step would be to study the distribution of fitness effects of these events. 

FIGURE 2: Aneuploidy in MA lines

FIGURE 2: Aneuploidy in MA lines. Adapted from Table 1 in Zhu et al. 2014. Among the 145 sequenced MA lines, 31 strains had an aneuploidy. 29 out of 31 strains had a chromosome duplication.

Patterns of single mutations & Methylation in Yeast?

In addition to aneuplodies, Zhu et al. also identified 867 single-nucleotide mutations. The patterns of these mutations can be used as a baseline for the mutational landscape without selection.

For example, Zhu et al. examined the frequencies of specific nucleotide changes – such as the frequency of an A to T mutation or an A to G mutation. There are 6 detectable nucleotide changes (because the strand of mutation origin is unknown) that Zhu et al. found are not evenly distributed in their MA lines. C to T mutations were particularly frequent at 35%, twice as high as the 17% null expectation. These biases in mutation rate can be used to refine nucleotide evolution models such as the Jukes Cantor, which will improve future phylogenetic analyses and tests for selection in yeast.

In addition to looking at the mutations themselves, Zhu et al. also looked at the frequencies of the neighboring bases – resulting in a peculiar finding. They found that GC mutations were twice as likely to occur in CCG and TCG sites (Figure 3). This type of elevation would not surprising in mammals because of its association with methylation, but methylation is considered rare (and possibly absent) in budding yeast. The authors carefully suggest that methylation in yeast could be a “parsimonious explanation” for this difference, but further studies will be necessary to confirm this. 

FIGURE 3: Taken from Figure 5 Zhu et al. 2014. Mutation rate of particular GC pairs depends on neighboring sites.

FIGURE 3: Taken from Figure 5 Zhu et al. 2014. Mutation rate of particular GC pairs depends on neighboring sites.

Continuing the saga 

300,000 generations of mutation accumulation is a lot. With this long list of confirmations, refinements, and surprises uncovered by Zhu et al, it may seem that we could now stop this yeast MA experiment, as it may not yield many more results. But, as studies of similar style have shown (such as Richard Lenski’s experiments), long-term passaging experiments continue to bear fruit even after many years. Time will tell whether these MA lines will continue to bring novel insight into the context and patterns of spontaneous mutations. One area the authors suggest could benefit from further data is the effect of genomic context on mutation rate, such as the difference of mutations occurring in noncoding versus coding regions.

Another area ripe for further experiments is the distribution of fitness effects of these mutations. It remains to be seen whether the various mutations seen in these lines are mostly neutral, deleterious, or advantageous. Such studies have both theoretical impacts, such as understanding the molecular clock, and practical implications, such as the relation of spontaneous mutations to disease (Eyre-Walker and Keightley 2007). However, calculating the distribution of fitness effects will be a difficult endeavor. Each strain has experienced several mutations, and thus determining the individual effect of mutations will likely be a challenging effort. Nevertheless, Yuan and colleagues have provided a great stepping-stone for studying the spontaneous mutation landscape at an unprecedented level of detail and scale.


Zhu Y. et al. Precise estimates of mutation rate and spectrum in yeast. PNAS. 2014

Halligan D.L. and Keightley P.D. Spontaneous Mutation Accumulation Studies in Evolutionary Genetics. Annu. Rev. Ecol. Evol. Syst. 2009. 40:151-72

Eyre-Walker A. and Keightley P.D. The distribution of fitness effects of new mutations. Nature Genetics Reviews 2007. 8:610-618

Yuan Zhu was a graduate students in Dmitri Petrov's lab. She defended her thesis in May 2014.

Yuan Zhu was a graduate student in Dmitri Petrov’s lab. She defended her thesis in May 2014.









SMBE symposium organized by three young Stanford researchers

Many young researchers go to conferences to give a talk or present a poster. But occasionally they get a chance to organize part of the conference. At SMBE, three young Stanford researchers (Ethan Jewett, Nicole Creanza and Oana Carja) are responsible for symposium 33. Ethan and Oana are graduate students and Nicole is a postdoc. The topic of their symposium is “Joint analysis of genetic and cultural data.” It takes place on Thursday afternoon (Thursday  June 12th, 15.15 – 16.45, San Geronimo). I asked the three of them a few questions.

Ethan Jewett

Ethan Jewett

Oana Carja

Oana Carja



When did you decide to propose a symposium? Did you  ever do this before?

We heard about this last fall when the call for symposia went out. We have never done this before.

Why did you choose the topic? Do you work on this topic yourself?

We chose the topic because culture is an important factor affecting human diversity and joint analyses of genetic and cultural data can provide a clearer picture of human evolution and diversity than analysis of either kind of data individually. The joint analysis of genetic and cultural data is not a topic that gets a lot of press at most genetics conferences.

Was it hard to find invited speakers? Who did you invite and why are you excited about their work?

No, it wasn’t difficult to find invited speakers. There are a lot of cool people working in this field so it was great to have the opportunity to hear some of these people speak. We invited Paul Verdu, who is a biological anthropologist who has a long history of working at the interface of genetic and non-genetic evolution to answer important questions about human history.

Was it hard to pick abstracts? How did you go about choosing? 

We had a lot of great abstracts to choose from and it was difficult to narrow things down to a few people. We tried to choose a diverse set of abstracts from individuals from different stages in their careers, different subject areas, different places and types of institutions, and we tried to maintain a gender balance. We initially did a blind review of abstracts to reduce our initial biases and we felt that this helped to reduce the bias in our decision process.

Are there posters associated with your symposium? When are they on display?

Yes, there are posters. Wednesday evening from 7:30-9:30pm and Thursday evening from 5:30-6:30pm. Posters will also be on display starting Wednesday morning, and will be on display through Thursday.

Posters for our symposium are numbers 2196-2200 (see page 23 of this pdf)

What do you hope most for your symposium?

We’d like to encourage people to work at the interface of genetic and cultural evolution because it is important to understand how genetics and culture shape each other. Culture is very important in shaping genetic variation and vice versa. We currently don’t understand how strongly culture impacts genetic variation.

Will you have time to enjoy Puerto Rico? If so, what do you plan to do?

Yes! Eat good food and talk good science!

A link to the symposium abstracts online is here.

Description of the symposium

Cultural factors—such as marriage customs, farming practices, and languages—can create population substructure, influence admixture, and place selective pressures on genetic variants (Quintana-Murci et al. 2008; Risch et al. 2009; Burger et al. 2007; Barbujani and Sokal 1990, Coia et al. 2012). An understanding of the impact of cultural practices on genetic variation can facilitate the use of genetic data to infer the cultural forces that have shaped modern populations. Moreover, the joint analysis of genetic and cultural data can provide more precise inferences of demographic histories. However, the ways in which cultural forces have shaped genetic diversity remain poorly understood. This symposium aims to explore recent advances in the joint analysis of genetic and cultural data, as well as methodologies for performing these analyses. The work presented in this symposium will connect researchers who work on these questions, and provide a basis for future research into the important role of culture in the evolution of populations.

Program (Thursday June 12th, 15.15 – 16.45, San Geronimo)

3:15 – 3:45  Parallel trajectories of genetic and linguistic admixture in Cape Verdean Kriolu speakers.
Paul Verdu*, Ethan Jewett, Trevor Pemberton, Noah Rosenberg, Marlyse Baptista

3:45 – 4:00  Genes mirror subsistence in prehistoric Europe
Mattias Jakobsson

4:00 – 4:15 Genome-wide analysis of Oceanian ancestry
Ana T. Duggan*, David Reich, Mark Stoneking

4:15 – 4:30 Cultural transmission of reproductive success: a strong evolutionary force that shapes genetic diversity.
Evelyne Heyer, Jean-Tristan Brandenburg, Michela Leonardi, Patricia Balaresque, Bruno Toupance, Tatyana Hegay, Almaz Aldashev, Frederic Austerlitz*

4:15 – 4:30  Biocultural Analysis of Variation in Blood Pressure among African Americans in the Health Equity Alliance of Tallahassee (HEAT) Heart Health Study
Laurel N. Pearson*, Sarah M. Szurek, Clarence C. Gravlee, Connie J. Mulligan

CEHG goes to Puerto Rico for SMBE (posters)



The SMBE meeting is one of the most important evolutionary biology meetings of the year. This year (2014) it takes place in Puerto Rico from June 9th till June 12th. The Stanford Center for Computational, Evolutionary and Human Genomics (CEHG) sponsors the event. We are also well represented with 17 talks, 7 posters and 2 symposia that are (co-)organized by CEHG members. Visit us at Stand 9!

Here is a list of the CEHG posters:

POSTER SESSION 1: P-1001 – P-1278

On Show:
Monday 9th – 13:00 – 15:30 / 17:00 – 17:30
Tuesday 10th – 13:00 – 15:30 / 17:00 – 17:30

Manned Session:
Monday 9th – 19:30 – 21:00
Tuesday 10th – 19:00 – 21:00

Suyash Shringarpure

Suyash Shringarpure

Shringarpure, Suyash
Fast, scalable and distributed dimensionality reduction of genome-wide data
S1 P-1016

Kimberly McManus

Kimberly McManus

Kimberly McManus
popRange: a highly flexible spatially and temporally explicit forward genetic simulator
S2 P-1041

Rajiv McCoy

Rajiv McCoy

TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly repetitive transposable elements
S5 P-1066

Nilah Ioannidis

Nilah Ioannidis

Inferring the effects of genetic variants on gene expression and splicing
S11 P-1145

Karla Sandoval

Karla Sandoval

Karla Sandoval
The genetic basis of preeclampsia in populations adapted to high altitude
S12 P-1186

Chris Gignoux

Chris Gignoux

Chris Gignoux
The Role of Human Demographic History in the Identification of Genetic Associations
S17 P-1265

POSTER SESSION 2: P-2001 – P-2279 & U-2280 – U-2289

On Show:
Wednesday 11th – 11:00 – 11:30 / 13:00 – 15:30 / 17:00 – 17:30
Thursday 12th -11:45 – 12:15 / 13:45 – 15:15

Manned Session:
Wednesday 11th – 19:00 – 21:30
Thursday 12th – 17:30 – 18:30

Rajiv McCoy

Rajiv McCoy

McCoy, Rajiv 
Characterizing patterns of human aneuploidy in a large sample of IVF patients
S26 P-2139

Shaila Musharoff

Shaila Musharoff

Musharoff, Shaila 
A Novel Likelihood Ratio Test for Sex-Biased Demography and the Effect of Cryptic Sex-Bias on the Estimation of Demographic Parameters
S26 P-2151

Alicia Martin

Alicia Martin

Martin, Alicia R 
The Genetic Architecture Of Skin Pigmentation In Southern Africa
S34 P-2206

Oana Carja

Oana Carja

Carja, Oana
On the evolution of mutation in spatially subdivided populations
S35 P-2220

Giltae Song

Giltae Song

Giltae Song
Pan genome analysis of Saccharomyces cerevisiae
S39  P-2269

CEHG goes to Puerto Rico for SMBE (talks)

The SMBE meeting is one of the most important evolutionary biology meetings of the year. This year (2014) it takes place in Puerto Rico from June 9th till June 12th. The Stanford Center for Computational, Evolutionary and Human Genomics (CEHG) sponsors the event. We are also well represented with 17 talks, 7 posters and 2 symposia that are (co-)organized by CEHG members. Visit us at Stand 9! Here is a list of the CEHG talks:

Monday June 9th

David Enard

David Enard

David Enard A global landscape of protein adaptation to viruses in mammals Monday 9th June:  S9: Evolutionary Networks 10:00 – 10:15

Tuesday June 10th

Dmitri Petrov

Dmitri Petrov

Dmitri Petrov Balancing selection and maintenance of variation as a natural consequence of adaptation in diploids Tuesday 10th June:  S18: Does ploidy matter? Ploidy impacts on evolutionary process 09:30 – 11:00

Zoe Assaf

Zoe Assaf Staggered sweeps: The obstruction of adaptation in diploids by recessive, strongly deleterious alleles Tuesday 10th June:  S18: Does ploidy matter? Ploidy impacts on evolutionary process 10:00 – 10:15


Nicole Creanza

Nicole Creanza Worldwide linguistic and genetic variation Tuesday 10th June:  S15: Out of Africa: Humans, commensals, pathogens, oh my! 10:00 – 10:30

David Enard

David Enard

David Enard Genome-wide signals of positive selection in human evolution.  Tuesday 10th June:  S12: Genomics of adaptation (cont) 12:15 – 12:30

Alan Bergland

Alan Bergland

Alan Bergland Genomic evidence of rapid and stable adaptive oscillations over seasonal time scales in Drosophila Tuesday 10th June:  S12: Genomics of adaptation 16:15 – 16:30

Ben Wilson

Ben Wilson

Ben Wilson Soft selective sweeps in complex demographic scenarios Tuesday 10th June:  S12: Genomics of adaptation 17:45 – 18:00

Diamantis Sellis

Diamantis Sellis

Diamantis Sellis Widespread heterozygote advantage in diploids Tuesday 10th June:  S12: Genomics of adaptation  18:45 – 19.00

Wednesday June 11th

Morten Rasmussen

Morten Rasmussen

Morten Rasmussen The genome of a Late Pleistocene human from a Clovis burial site in western Montana Wednesday 11th June:  S27: Genomic perspectives on the population history of the Americas 09:30 – 09:45

Maria Avila

Maria Avila

María Ávila-Arcos Tracing the genetic ancestry of enslaved Africans using ancient DNA Wednesday 11th June:  S27: Genomic perspectives on the population history of the Americas 10:45 – 11:00

Andres Moreno Estrada

Andres Moreno Estrada

Andres Moreno Estrada Patterns of genetic diversity in Latin America: insights from human population genomics Wednesday 11th June:  S27: Genomic perspectives on the population history of the Americas 11:30 – 11:45

Philipp Messer

Philipp Messer

Philipp Messer New statistical methods detect both hard and soft sweeps in malaria parasites Wednesday 11th June:  S25: Detecting selection in natural populations: making sense of genome scans and towards alternative solutions 12:30 – 12:45

Gavin Sherlock

Gavin Sherlock

Gavin Sherlock Tracking hundreds of thousands of lineages in an evolving population allows determination of the beneficial mutation rate and elucidation of the distribution of their fitness effects Wednesday 11th June:  S23: Genome-scale Approaches in Experimental Evolution 12:45 – 13:00

Nandita Garud

Nandita Garud

Nandita Garud Disentangling the effects of demography and selection on haplotype signatures in Drosophila. Wednesday 11th June:  S25: Detecting selection in natural populations: making sense of genome scans and towards alternative solutions  16:15 – 16:30

Carlo Artieri

Carlo Artieri

Carlo Artieri Accounting for biases in riboprofiling data identifies a conserved effect of proline incorporation as the major determinant of translational stalling Wednesday 11th June:  S24: Creative use of nest generation sequencing technology in evolutionary genomics: solving old problems with new approaches 18:45 – 19:00

Thursday June 12th

Tomas Babak

Tomas Babak

Tomas Babak An atlas of human and mouse genomic imprinting reveals evolutionary causes and consequences Thursday 12th June:  S36: Evolutionary Epigenomics 13:00 – 13:15

Fernando Mendez

Fernando Mendez

Fernando Mendez Use of Long-Read Sequence-aided phasing to improve ancestry assignment in admixed populations Thursday 12th June:  S39: Next generation Genome Annotation and Analysis 16:15 – 16:30

How we organized the BAPGX conference

Author: Pleuni Pennings, outreach director of CEHG.

Author: Pleuni Pennings, outreach director of CEHG.

(Cross-posted from

I love working with a team to organize an event. The tenth Bay Area Population Genomics conference (BAPGX) was a fun event to organize. It is part of a successful series of conferences and there was plenty of support, both at Stanford and from the community, to organize it. Five Stanford postdocs volunteered to help out and did a great job.

I think the conference was a success. Here are some of the things we did to make it that success.

How it got started

  1.     Dmitri Petrov (who initiated the BAPG series) asked me (Pleuni) if I could organize the conference.
  2.     Dmitri and I picked a date (not realizing it was Memorial day weekend!) and after that I was free to organize it the way I wanted.
  3.     I asked the CEHG mailing list for volunteers and –within 10 minutes!– found five postdocs who were willing to help me organize the event. We were ready to get started!

The BAPGX committee, Pleuni Pennings (@pleunipennings), Maria Avila (@maricugh), Carlo Artieri (@Carlo_Artieri), David Enard (@DavidEnard), Dave Yuan (@13bee_slurpee), Bridget Algee-Hewitt (@BridgetAH), Dmitri Petrov (@PetrovADmitri, not in the picture).

The BAPGX committee

  1.     The BAPGX committee met 3 times. The first meeting was mostly to brainstorm, the other meetings were more focused on logistics.
  2.     We sent many (many!) emails within the committee.
  3.     We kept notes and files in a shared folder on Google Drive.
  4.     We split tasks: Bridget was in charge of communication with speakers and participants.
  5.     Carlo was in charge booking the location, catering, and poster boards.
  6.     Maria was in charge of mugs and name stickers.
  7. David was in charge of the logo and photography during the event.
  8. Dave was in charge of wine and cheese and printing the schedules and signs.
  9. Pleuni was in charge of the website, Facebook, twitter and the money.
  10. We decided on the program (and many other things) together.
There was a BAPGX mug for every participant.

There was a BAPGX mug for every participant.


  1. We looked at several lecture halls at Stanford and chose M106 in the Alway building, even though it wasn’t fancy, because it had the right size (140 seats) and was adjacent to a courtyard (great for registration, breaks and posters). It was also good because people could take their coffee with them into the room and because we could order food from an outside vendor.
  2. We decided to start the conference at 10AM, so that it was convenient for people who wanted to take the Caltrain (the first arrived at 9:17 in Palo Alto).
  3. We put travel instructions on the website and also sent them by e-mail:
  4. We put up some signs to make it easier for people to find the Alway building.
  5. We encouraged people to ride share and included a column in the registration Google doc with ride share information.
  6. We had two of our committee members (Maria and David) in and at the Caltrain to guide people to the lecture hall.
  7. We made sure the lecture hall would be open on the day of the conference.
  8. Two of us washed all the mugs and four of us went to Costco two days before the conference.

    Dave and David are preparing the registration area.

    Dave and David are preparing the registration area.


  1. Registration was free and open to all.
  2. We decided to cap participation at 150 (even though we had only 140 seats) under the assumption that some people would cancel or not show.
  3. To sign up, people simply added their name, email, affiliation, food preference, whether they’d bring a poster and ride shareride share info to a Google doc.
  4. A few days before the conference we “locked” the Google doc and asked people to email us instead.
  5. We reached 150 registrations around one week before the conference. After that approximately 10 people canceled and around 5 additional people were admitted. A few people didn’t show up and a few crashed the conference, but this was no problem as we had enough seats and food.
A full lecture hall.

A full lecture hall.

Using social media to build momentum

One of the great things about working with an active community and a motivated committee is that we could build a lot of momentum before the conference.

  1. We had a simple website (
  2. The committee communicated with the community through emails (to individual people, to the speakers and the poster presenters and to the BAPG Google group).
  3. We hoped for participation from many different universities and made additional efforts to encourage people from SFSU, Santa Clara and UCSC to sign up. We also had participants from UCSF, UC Davis, UC Berkeley, the Cal Academy of Sciences,, Stanford and a few other institutions.
  4. We used twitter (all committee members are on twitter).
  5. We decided on a twitter hashtag early on (#BAPGX).
  6. We used Facebook (through the CEHG Facebook page).
  7. We tried to keep everyone updated on the program and everything else we were working on (logo, mugs, cheese etc.) to show that we were working hard and that we were excited about the conference.
  8. We asked three active tweeters from the community to live-tweet the conference (@razibkhan, @mwilsonsayres and @JeremyJBerg)

    One of many tweets on the day of #BAPGX

    One of many tweets on the day of #BAPGX


  1. We received financial support from and from CEHG (thank you!!).
  2. For the CEHG money we had to write a short proposal, but basically used the same text as we already had on the website.
  3. We spent money on food and coffee ($2,266), water, juice, soda, wine and grapes ($711), cheese ($430), mugs ($715), and name stickers ($76).
  4. We saved money by buying water, juice, soft drinks, wine, crackers, grapes, paper plates and plastic cups at Costco.
  5. We also saved money by using stickers as badges (instead of more fancy badges).
  6. We saved money by not video-recording anything.
  7. We decided not to spend any money on inviting an outside speaker or gifts for speakers.
  8. We got help with the financial administration from CEHG (Cody Sam) and from Elena Yujuico (Dmitri’s admin).

(Almost) glitches

  1. We didn’t realize until fairly late that we planned the conference in Memorial Day weekend.
  2. We didn’t think about Wi-Fi access for guests until the night before the conference. On the day itself we created a guest login for the Stanford network and that worked.
  3. We were not consistent about the length of the normal talks. We originally announced that they’d be 12 minutes (+3 minutes for questions), but then this somehow became 15 minutes (+5 minutes for questions).


  1. We allowed for normal talks (15 min) and mini talks (5 min) and got an equal number of abstracts for both.
  2. We set a deadline for talk submissions (4 weeks before conference) and another deadline for registration (one week before conference).
  3. We asked people to submit abstracts (this was not done for previous BAPG conferences. At previous BAPG conferences, people were encouraged to add their name in a Google doc to sign up for a talk, but we thought that making the process a little more formal would get us different speakers and potentially more well-prepared speakers).
  4. We accepted all talks that were submitted before the deadline (but none that were submitted after).
  5. Our program was a bit longer than previous BAPG programs, because we decided to accept all submitted abstracts and because we wanted to be sure there was ample time for questions.
  6. We asked speakers to send us their slides (only Powerpoint or PDF allowed) a few days before the conference (in the end we had all files before the start of the conference!)
  7. We organized a practice-your-talk session for the speakers from Stanford.


  1. Participants were encouraged to bring a poster. No titles or abstracts needed, just a “yes” in the right column in the Google doc.
  2. 11 People signed up to bring a poster, in the end there were 9 posters.
  3. We borrowed simple poster boards from the Beckman center and brought pins and tape (We considered higher quality poster boards, but they would have cost around $500 rental fee at Stanford).
  4. Posters were up the whole day (from the coffee break in the morning).

The day itself

  1. All of the committee (except Maria who was on the Caltrain) met at 7:30 to set up everything for the day.
  2. We made sure we each had all the other phone numbers.
  3. We brought several computers, VGA cables, thumb drives etc (but in the end all worked from the computer that was already in the lecture hall).
  4. We brought plates and knives for the cheese and grapes.
  5. We asked Dmitri Petrov to say something at the start of the conference.
  6. We split chairing between three members of the organizing committee (Carlo, Maria, David).
  7. Bridget and Pleuni “manned” the registration table from 9 till 10:30 and explained to everyone how to indicate their affiliation on their name sticker.
Name sticker with associations indicated (a yellow dot at the approximate location of Stanford and a green dot at the location of SFSU).

Name sticker with associations indicated (a yellow dot at the approximate location of Stanford and a green dot at the location of SFSU).

  1. We had fairly long breaks and allowed plenty of time for questions. Each session ran a few minutes late, but we caught up during the breaks.
  2. Each of the members of the organizing committee missed at least one of the sessions to be outside to handle food.
  3. Our three designated twitter volunteers did a great job live-tweeting the conference and many others joined in. The result was storified here:
  4. David brought his camera (photo’s will follow!).
  5. Originally we had someone (not the chair) assigned the task of keeping track of time during the talks, but in the end we decided that it worked better if the chair did it him/herself.

After the conference

  1. We all stayed till the end and cleaned up the mess.
  2. We found a home for the leftover mugs & the leftover wine and cheese.
  3. We wrote this document.
  4. We sent an email to all participants to thank them for coming & for great talks and posters & to tell them about the storified tweets.
  5. We plan to have a nice lunch or dinner with the committee to celebrate the success of the conference.
  6. We’ll publish the photos.
  7. We’ll round up the financial administration.

Other notes

  1. The talks were of very high quality (thank you, speakers!!). The fact that we asked presenters to send us their slides beforehand may have contributed to that.
  2. We got a lot of positive feedback about the mini talks (lightning talks), so it may be a good idea to keep that as part of BAPG.
  3. Many people stayed for the posters and cheese. The cheese may have helped with that; we had a Toma (cow’s milk cheese from Pointe Reyes), Comte (cow’s milk cheese from France), Brabander (goat’s milk Gouda from Holland) and Casatica (buffalo milk soft cheese from Italy).

A framework for identifying and quantifying fitness effects across loci

Blog author Ethan Jewett is a PhD student in the lab of Noah Rosenberg.

Blog author Ethan Jewett is a PhD student in the lab of Noah Rosenberg.

The degree to which similarities and differences among species are the result of natural selection, rather than genetic drift, is a major question in population genetics. Related questions include: what fraction of sites in the genome of a species are affected by selection? What is the distribution of the strength of selection across genomic sites, and how have selective pressures changed over time? To address these questions, we must be able to accurately identify sites in a genome that are under selection and quantify the selective pressures that act on them.

Difficulties with existing approaches for quantifying fitness effects    

A recent paper in Trends in Genetics by David Lawrie and Dmitri Petrov (Lawrie and Petrov, 2014) provides intuition about the power of existing methods for identifying genomic regions affected by purifying selection and for quantifying the selective pressures at different sites. The paper proposes a new framework for quantifying the distribution of fitness effects across a genome. This new framework is a synthesis of two existing forms of analysis – comparative genomic analyses to identify genomic regions in which the level of divergence among two or more species is smaller than expected, and analyses of the distribution of the frequencies of polymorphisms (the site frequency spectrum, or SFS) within a single species (Figure 1). Using simulations and heuristic arguments, Lawrie and Petrov demonstrate that these two forms of analysis can be combined into a framework for quantifying selective pressures that has greater power to identify selected regions and to quantify selective strengths than either approach has on its own.

Figure 1. Using the quantify the strength of purging selection. The SFS tabulates the number of polymorphisms at a given frequency in a sample of haplotypes. Under neutrality (black dots) many high-frequency polymorphisms are observed. Under purifying selection (higher values of the effective selection strength |4Nes|), a higher fraction of new mutations are deleterious, leading to fewer high-frequency polymorphisms (red and blue dots). Adapted from Lawrie and Petrov (2014).

Figure 1. Using the site frequency spectrum (SFS) to quantify the strength of purifying selection. The SFS tabulates the number of polymorphisms at a given frequency in a sample of haplotypes. Under neutrality (black dots) many high-frequency polymorphisms are observed. Under purifying selection (higher values of the effective selection strength |4Nes|), a higher fraction of new mutations are deleterious, leading to fewer high-frequency polymorphisms (red and blue dots). Adapted from Lawrie and Petrov (2014).

Lawrie and Petrov begin by discussing the strengths and weaknesses of the two existing approaches. Comparative analyses of genomic divergence are beneficial for identifying genomic regions under purifying selection, which will exhibit lower-than-expected levels of divergence among species. However, as Lawrie and Petrov note, it can be difficult to use comparative analyses to quantify the strength of selection in a region because even mild purifying selection can result in complete conservation among species within the region (Figure 2). For example, whether the population-scaled selective strength, 4Nes, in a region is 20 or 200, the same genomic signal will be observed, complete conservation.

Figure 1. Adapted from Lawrie and Petrov (2013). The evolution of several 100kb regions was simulated in 32 different mammalian species under varying strengths of selection |4Nes|. The number of substitutions in each region was then estimated using genomic evolutionary rate profiling (GERP). The plot shows the median across regions of the number of inferred substitutions. From the plot, it can be seen that, once the strength of selection exceeds a weak threshold value (3 for the example given), there is full conservation among species.

Figure 1. Adapted from Lawrie and Petrov (2013). The evolution of several 100kb regions was simulated in 32 different mammalian species under varying strengths of selection |4Nes|. The number of substitutions in each region was then estimated using genomic evolutionary rate profiling (GERP). The plot shows the median across regions of the number of inferred substitutions. From the plot, it can be seen that, once the strength of selection exceeds a weak threshold value (3 for the example given), there is full conservation among species.

In contrast to comparative approaches, analyses of within-species polymorphisms based on the site frequency spectrum (SFS) within a region can be used to more precisely quantify the strength of selection. For example, Figure 1 shows that different selective strengths can produce very different site frequency spectra. Moreover, if the SFS can be estimated precisely enough, it can allow us to distinguish between two different selective strengths (e.g., 4Nes1 = 20 and 4Nes2 = 200) that would both lead to total conservation in a comparative study, and would therefore be indistinguishable. The problem is that it takes a lot of polymorphisms to obtain an accurate estimate of the SFS, and a genomic region of interest may contain too few polymorphisms, especially if the region is under purifying selection, which decreases the apparent mutation rate. Sampling additional individuals from the same species may provide little additional information about the SFS because few novel polymorphisms may be observed in the additional sample. For example, recall that for a sample of n individuals from a wildly idealized panmictic species, the expected number of novel polymorphisms observed in the n+1st sampled individual is proportional to 1/n (Watterson1975).

A proposed paradigm

Lawrie and Petrov demonstrate that studying polymorphisms by sampling many individuals across several related species (rather than sampling more individuals within a single species) could increase the observed number of polymorphisms in a region, and therefore, could increase the power to quantify the strength of selection (Figure 3) – as long as the selective forces in the genomic region are sufficiently similar across the different species.


Figure 3. The benefits of studying polymorphisms in many populations, rather than within a single population. Three populations (A, B, and C) diverge from an ancestral population, D. The genealogy of a single region is shown (slanted lines) with mutations in the region denoted by orange slashes. Additional lineages sampled in population A are likely to coalesce recently with other lineages (for example, the red clade in population A ) and, therefore, carry few mutations that have not already been observed in the sample. In comparison, the same number of lineages sampled from a second population are likely to carry additional independent polymorphisms (for example, the red lineages in population B). If the selective pressures at the locus in populations A and B are similar, then the SFS in the two populations should be similar, and the additional lineages in B can provide additional information about the SFS. For example, if the demographic histories and selective pressures at the locus are identical in populations A and B, and if the samples from populations A and B are sufficiently diverged, then a sample of K lineages from each population, A and B, will contain double the number of independent polymorphisms that are observed in a sample of K lineages from population A alone, providing double the number of mutations that can be used to estimate the SFS.

The need for sampling depth and breadth

Without getting bogged down in the details, it’s the rare variants that are often the most important for quantifying the effects of purifying selection, so one still has to sample deeply within each species; however, overall, sampling from additional species is a more efficient way of increasing the absolute number of variants that can be used to estimate the SFS in a region, compared with sampling more deeply within the same species.

The simulations and heuristic arguments presented by Lawrie and Petrov consider idealized cases for simplicity; however, the usefulness of approaches that consider polymorphisms across multiple species has been demonstrated in methods such as the McDonald-Kreitman test (McDonald and Kreitman, 1991), which have long been important tools for studying selection. More recent empirical applications of approaches that consider information about polymorphisms across multiple species appear to do a good job of quantifying selective pressures across genomes (Wilson et al., 2011; Gronau et al., 2013), even when species are closely related (De Maio et al., 2013). Overall, the simulations and arguments presented in Lawrie and Petrov’s paper provide useful guidelines for researchers interested in identifying and quantifying selective forces, and their recommendation to sample deeply within species and broadly across many species comes at a time when such analyses are becoming increasingly practical, given the recent availability of sequencing data from many species.


  1. De Maio, N., Schlötterer, C., and Kosiol, C. (2013). Linking great apes genome evolution across time scales using polymorphism-aware phylogenetic models. Molecular biology and evolution30:2249-2262.
  2. Gronau, I., Arbiza, L., Mohammed, J., and Siepel, A. (2013). Inference of natural selection from interspersed genomic elements based on polymorphism and divergence. Molecular biology and evolution30:1159-1171.
  3. Lawrie, D.S. and Petrov, D.A. (2014). Comparative population genomics: power and principles for the inference of functionality. Trends in Genetics30:133-139.
  4. Watterson, G.A. (1975). On the number of segregating sites in genetical models without recombination. Theoretical population biology7:256-276.
  5. Wilson, D.J., Hernandez, R.D., Andolfatto, P., and Przeworski, M. (2011). A population genetics-phylogenetics approach to inferring natural selection in coding sequences. PLoS genetics7:e1002395.

Paper author: David Lawrie was a graduate student in Dmitri Petrov’s lab. He is now a postdoc at USC.




Testing for selection in regulatory sequences using an empirical mutational distribution

How to detect selection?

Dave Yuan is a postdoc in Dmitri Petrov's lab.

Blog author Dave Yuan is a postdoc in Dmitri Petrov’s lab.

Detecting and quantifying selection in genomes is a fundamental task of interest for evolutionary biologists. A common method for this relies on comparing patterns of polymorphism and divergence between synonymous and non-synonymous sites. Synonymous sites are expected to be almost neutral, and thus mutations at these sites are expected to be fixed or lost due to genetic drift or draft. At non-synonymous sites however, mutations may get fixed due to positive selection or lost due to purifying selection. If in a specific gene, many non-synonymous sites get fixed due to positive selection, then these sites as a group will show a high evolutionary rate. On the other hand, if in a specific gene most non-synonymous mutations are lost because of purifying selection, then these sites will show a low evolutionary rate. Importantly, to determine whether the rate is high or low, we need a group of sites that can be used as a neutral comparison. For coding regions, synonymous sites are a natural choice for this comparison. [McDonald & Kreitman 1991; Keightley & Eyre-Walker 2007, Bustamante et al. 2001].

What about non-coding sequences?

Much of the genome, however, is comprised of non-coding sequence. Such sequence may contain regulatory information critical for gene expression, the modification of which is important for phenotypic evolution. Detecting selection among regulatory variation is thus of interest to evolutionary biologists, but this has been challenging. This is because functional annotation of non-coding DNA tends to be sparse, and we currently do not understand the “regulatory genetic code.” Although selection tests developed for coding sequence have been applied to non-coding sequence [reviewed in Zhen & Andolfatto 2012], a common impediment has been the choice of a group of sites that can function as a neutral comparison. A solution to this is to generate a large number of mutations in a specific region of the genome and determine whether these mutations have functional impacts. The sites at which mutations do not appear to have function can then be used to compare other groups of sites with. In a recent paper published in Molecular Biology and Evolution, graduate students Justin Smith and Kimberly McManus and CEHG faculty Hunter Fraser describe their development and application of this novel method to test for selection among variation in mammalian regulatory elements using such null distribution of mutations.

Null distribution of random mutations

Mutagenesis technique used by Patwardhan et al. (2012) to generate a comprehensive collection of cis-regulatory element mutants and test their phenotypes in vivo (figures from Patwardhan et al., 2012)

Mutagenesis technique used by Patwardhan et al. (2012) to generate a comprehensive collection of cis-regulatory element mutants and test their phenotypes in vivo (figures from Patwardhan et al., 2012)

Generating an empirical null distribution as the neutral comparison is not a trivial task. A sufficiently large—ideally comprehensive—set of mutations needs to be engineered into the regulatory element of interest, and the mutational effects or phenotypes need to be assessed. This distribution of phenotypes is then the null distribution against which the observed variation is compared to test for selection. Fortunately, recent developments in mutagenesis coupled with high-throughput sequencing have made this possible in high-resolution. Smith et al. chose data from one such mutagenesis platform that generated over 640,000 mutant haplotypes across three mammalian enhancer sequences [Patwardhan et al. 2012]. Specifically, the library of mutant enhancers was made using polymerase cycling assembly (PCA) with oligonucleotides containing between 2-3% degeneracy. All possible single nucleotide variants of the wild-type enhancer were thus represented. The library of enhancers was then cloned into a plasmid upstream of a reporter gene along with unique identification tags. This plasmid library was both sequenced to identify the tag corresponding with the mutant enhancer and injected into mouse for in vivo reporter assay. Finally, sequencing of the cDNA from the mouse liver quantified the transcriptional abundance of the tags and hence the phenotypic effects of the mutations. For each mutation it was now clear whether it upregulated or downregulated the reporter gene or whether it had no effect.

Developing a test to compare mutations and observed variation

With this dataset, Smith et al. had a comprehensive spectrum of random mutations and their phenotypic effects as the null distribution. This allowed them to create metrics for regulatory variation that are similar to the commonly-used Ka/Ks ratio, with Ka being the rate of non-synonymous change and Ks the rate of synonymous change (no functional impact on protein and hence neutral) [Kimura 1977]. The in vivo reporter assay revealed mutations with no phenotypic impact (i.e. no change in transcriptional abundance compared to wild-type), and these are analogous to synonymous or neutral changes. The new metrics are dubbed Ku/Kn, and Kd/Kn, where Ku is the rate of change for up-regulatory mutations (those with increased expression from the in vivo reporter assay), Kd is the rate of change for down-regulatory mutations, and Kn the rate of change for mutations that didn’t change expression (silent or neutral mutations).

Metrics to compare observed mutations in the phylogeny to possible mutations seen in the mutagenesis data (Figure 1 from Smith et al 2013).

Metrics to compare observed mutations in the phylogeny to possible mutations seen in the mutagenesis data (Figure 1 from Smith et al 2013).

For their analysis, the authors chose enhancer sequences from species within the same phylogenetic orders as the mutagenized enhancers. In addition to enhancer sequences from extant species, the authors also reconstructed ancestral sequences throughout the phylogeny. Combined with the mutagenesis data, each K metric at a node in the phylogeny is then calculated as the ratio of observed (i.e. in ancestors and extant species) frequencies of silent, up-, or down-regulatory polymorphisms to the frequencies of all possible silent, up-, or down-regulatory mutations respectively. Selection is inferred by comparing the ratio of up- or down-regulatory polymorphisms to the ratio of silent mutations (i.e. Ku/Kn or Kd/Kn). A comparatively low Ku or Kd, or rate of up- or down-regulatory mutations (Ku/Kn or Kd/Kn < 1) would suggest purifying selection on the polymorphisms, while a higher rate of up- or down-regulatory mutations (Ku/Kn or Kd/Kn > 1) would suggest positive selection. Smith et al. applied their new test for selection on the three enhancers from [Patwardhan et al. 2012] across the respective phylogenetic orders: LTV1 in rodents and ALDOB and ECR11 in primates. They detected purifying selection against down-regulatory polymorphisms for all three enhancers, while positive selection for up-regulatory polymorphisms was also detected for LTV1.

Detecting selection using an empirically-derived null distribution

Making evolutionary sense of variation in the regulatory regions of the genome remains more challenging than for coding sequences. We still do not have a “neutral model of regulatory evolution” to compare observed variation against. Perhaps the most exciting element of this paper, at least for me, is the use of an empirically-derived null distribution as the neutral expectation to perform this evolutionary inquiry. Patwardhan and the Shendure group at the University of Washington had earlier published a mutagenesis technique that generated a wide spectrum of mutants [Patwardhan et al. 2009]. At this time, I was getting interested in questions on the “grammar” of gene regulation, the functional characterization of regulatory sequences, and how to understand regulatory variation evolutionarily. It was thus very exciting to see both, a massively comprehensive interrogation of the mutational consequences in a regulatory element, as well as the clever application of this data to overcome a challenging evolutionary question.

One of the strengths of the Smith et al. study is the reliance on a spectrum of random mutations as the null distribution. As the original source of all genetic variation, mutations arise in a random manner. Of those that do not exert lethal effects, they may persist by chance within a population and then eventually reach certain frequencies or even fixation under selection. Because the null distribution used by Smith et al. comprises all possible mutations, it represents the mutation spectrum prior to the actions of drift or selection. It is thus an even better neutral expectation than synonymous mutations, which may not be truly neutral. In addition, using such empirical null distribution to test for selection is not limited to just regulatory variation but can be applied to coding sequence variation to reduce bias and false signals. Furthermore, by categorizing mutational effects as up- and down-regulatory, different modes of selection acting on a regulatory element can be teased apart. The interspersion of mutations—silent, up-, or down-regulatory—across the regulatory element also reduces confounding effects of regional variation in mutation rate.

As with all science, more is hoped for the future. Towards the end of the paper, the authors discuss prospects of more high-resolution mutagenesis data and, perhaps more importantly in terms of accessibility and ease of use, ability to use limited mutagenesis to test selection with. Tissue- and organism-specificity in terms of mutational effects may also be further investigated, as well as the inclusion of mutation types other than single nucleotide substitutions (e.g. insertion/deletion, copy number variation) or consideration of genomic regional context (e.g. effect of chromatin or epistasis). Nevertheless, this study represents an exciting new method to investigate regulatory variation in evolutionary contexts, one whose development and further application I look forward to seeing.


Bustamante CD, Wakeley J, Sawyer S, and Hartl DL. Directional Selection and the Site-Frequency Spectrum. Genetics 159:1779-1788 (2001).

Keightley PD and Eyre-Walker A. Joint inference of the distribution of fitness effects of deleterious mutations and population demography based on nucleotide polymorphism frequencies. Genetics 177:2251-2261 (2007).

Kimura M. Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution. Nature 267:275-276 (1977).

McDonald JH and Kreitman M. Adaptive Protein Evolution at the Adh Locus in Drosophila. Nature 351:652-654 (1991).

Patwardhan RP, Lee C, Litvin O, Young DL, Pe’er D, and Shendure J. High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis. Nature Biotechnology 27:1173-1175 (2009)

Patwardhan RP, Hiatt JB, Witten DM, Kim MJ, Smith RP, May D, Lee C, Andrie JM, Lee S-I, Cooper GM, et al. Massively parallel functional dissection of mammalian enhancers in vivo. Nature Biotechnology 30:265-270 (2012).

Smith JD, McManus KF, and Fraser HB. A Novel Test for Selection on cis-Regulatory Elements Reveals Positive and Negative Selection Acting on Mammalian Transcriptional Enhancers.    Molecular Biology and Evolution 30:2509-2518 (2013).

Zhen Y and Andolfatto P. Methods to Detect Selection on Noncoding DNA in Evolutionary Genomics: Statistical and Computational Methods, Volume 2, Methods in Molecular Biology, vol. 856, edited by Anisimova M. Humana Press, New York (2012).

Paper author Justin Smith is a graduate student in Hunter Fraser's lab.

Paper author Justin Smith is a graduate student in Hunter Fraser’s lab.

A fast and accurate coalescent approximation

Blog author Suyash Shringarpure is a postdoc in Carlos Bustamante's lab.

Blog author Suyash Shringarpure is a postdoc in Carlos Bustamante’s lab. Suyash is interested in statistical and computational problems involved in the analysis of biological data.

The coalescent model is a powerful tool in the population geneticist’s toolbox. It traces the history of a sample back to its most recent common ancestor (MRCA) by looking at coalescence events between pairs of lineages. Starting from assumptions of random mating, selective neutrality, and constant population size, the coalescent uses a simple stochastic process that allows us to study properties of genealogies, such as the time to the MRCA and the length of the genealogy, analytically and through efficient simulation. Extensions to the coalescent allow us to incorporate effects of mutation, recombination, selection and demographic events in the coalescent model. A short introduction to the coalescent model can be found here and a longer, more detailed introduction can be read here.

However, coalescent analyses can be slow or suffer from numerical instability, especially for large samples. In a study published earlier this year in Theoretical Population Biology, CEHG fellow Ethan Jewett and CEHG professor Noah Rosenberg proposed fast and accurate approximations to general coalescent formulas and procedures for applying such approximations. Their work also examined the asymptotic behavior of existing coalescent approximations analytically and empirically.

Computational challenges with the coalescent

For a given sample, there are many possible genealogical histories, i.e., tree topologies and branch lengths, which are consistent with the allelic states of the sample. Analyses involving the coalescent therefore often require us to condition on a specific genealogical property and then sum over all possible genealogies that display the property, weighted by the probability of the genealogy. A genealogical property that is often conditioned on is n_t, the number of ancestral lineages in the genealogy at a time t in the past. However, computing the distribution P(n_t) of n_t is computationally expensive for large samples and can suffer from numerical instability.

A general approximation procedure for formulas conditioning on n_t

Coalescent formulas conditioning on n_t typically involve sums of the form f(x)=\sum_{n_t} f(x|n_t) \cdot P(n_t)

For large samples and recent times, these computations have two drawbacks:

–       The range of possible values for n_t may be quite large (especially if multiple populations are being analyzed) and a summation over these values may be computationally expensive.

–       Expressions for P(n_t) are susceptible to round-off errors.

Slatkin (2000) proposed an approximation to the summation in f(x) by a single term f(x|E[n_t]). This deterministic approximation was based on the observation that n_t changes almost deterministically over time, even though it is a stochastic variable in theory. Thus we can write n_t \approx E[n_t]. From Figure 2 in the paper (reproduced here), we can see that this approximation is quite accurate. The authors prove the asymptotic accuracy of this approximation and also prove that under regularity assumptions, f(x|E[n_t]) converges to f(x) uniformly in the limits of t \rightarrow 0 and t \rightarrow \infty . This is an important result since it shows that the general procedure produces a good approximation for both very recent and very ancient history of the sample. Further, the paper shows how this method can be used to approximate quantities that depend on the trajectory of n_t over time, which can be used to calculate interesting quantities such as the expected number of segregating sites in a genealogy.

Approximating E[n_t] for single populations

A difficulty with using the deterministic approximation is that E[n_t] often has no closed-form formula, and if one exists, it is typically not easy to compute when the sample is large.

For a single population with changing size, two deterministic approximations have previously been developed (one by Slatkin and Rannala 1997, Volz et al. 2009 and one by Frost and Volz, 2010, Maruvka et al., 2011). Using theoretical and empirical methods, the authors examine the asymptotic behavior and computational complexity of these approximations and a Gaussian approximation by Griffiths. A summary of their results is in the table below.

Method Accuracy
Griffith’s approximation Accurate for large samples and recent history.
Slatkin and Rannala (1997), Volz et al. (2009) Accurate for recent history and arbitrary sample size, inaccurate for very ancient history.
Frost and Volz (2010), Maruvka et al. (2011) Accurate for both recent and ancient history and for arbitrary sample size.
Jewett and Rosenberg (2014) Accurate for both recent and ancient history and arbitrary sample size, and for multiple populations with migration.


Approximating E[n_t] for multiple populations

Existing approaches only work for single populations of changing size and cannot account for migration between multiple populations. Ethan and Noah extend the framework for single populations to allow multiple populations with migration. The result is a system of simultaneous differential equations, one for each population. While it does not allow for analytical solutions except in very special cases, the system can be easily solved numerically for any given demographic scenario.

Significance of this work

The extension of the coalescent framework to multiple populations with migration is an important result for demographic inference. The extended framework with multiple populations allows efficient computation of demographically informative quantities such as the expected number of private alleles in a sample, divergence times between populations.

Ethan and Noah describe a general procedure that can be used to approximate coalescent formulas that involve summing over distributions conditioned on n_t or the trajectory of n_t over time. This procedure is particularly accurate for studying very recent or very ancient genealogical history.

The analysis of existing approximations to E[n_t] show that different approximations have different asymptotic behavior and computational complexities. The choice of which approximation to use is therefore often a tradeoff between the computational complexity of the approximation and the likely behavior of the approximation in the parameter ranges of interest.

Future Directions

As increasingly large genomic samples from populations with complex demographic histories become available for study, exact methods either become intractable or very slow. This work adds to a growing set of approximations to the coalescent and its extensions, joining other methods such as conditional sampling distributions and the sequentially markov coalescent. Ethan and Noah are already exploring applications of these approximate methods to reconciling gene trees with species trees. In the future, I expect that these and other approximations will be important for fast and accurate analysis of large genomic datasets.


[1] Jewett, E. M., & Rosenberg, N. A. (2014). Theory and applications of a deterministic approximation to the coalescent model. Theoretical population biology.

[2] Griffiths, R. C. (1984). Asymptotic line-of-descent distributions. Journal of Mathematical Biology21(1), 67-75.

[3] Frost, S. D., & Volz, E. M. (2010). Viral phylodynamics and the search for an ‘effective number of infections’. Philosophical Transactions of the Royal Society B: Biological Sciences365(1548), 1879-1890.

[4] Maruvka, Y. E., Shnerb, N. M., Bar-Yam, Y., & Wakeley, J. (2011). Recovering population parameters from a single gene genealogy: an unbiased estimator of the growth rate. Molecular biology and evolution28(5), 1617-1631.

[5] Slatkin, M., & Rannala, B. (1997). Estimating the age of alleles by use of intraallelic variability. American journal of human genetics60(2), 447.

[6] Slatkin, M. (2000). Allele age and a test for selection on rare alleles.Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences355(1403), 1663-1668.

[7] Volz, E. M., Pond, S. L. K., Ward, M. J., Brown, A. J. L., & Frost, S. D. (2009). Phylodynamics of infectious disease epidemics. Genetics183(4), 1421-1430.

Paper author Ethan Jewett is a PhD student in the lab of Noah Rosenberg.

Paper author Ethan Jewett is a PhD student in the lab of Noah Rosenberg.

Dissecting the dynamics of adaptation with experimental evolution

Open space within the black box

Blog authors: Carlos Araya is a postdoctoral researcher in the laboratories of Michael Snyder and William Greenleaf interested in a broad range of areas, including molecular structure-function relationships, directed evolution, regulatory control, and cancer. Carlos holds a Ph.D. in Genome Sciences from the University of Washington.

Blog author: Carlos Araya is a postdoctoral researcher in the laboratories of Michael Snyder and William Greenleaf interested in a broad range of areas, including molecular structure-function relationships, directed evolution, regulatory control, and cancer. Carlos holds a Ph.D. in Genome Sciences from the University of Washington.

We have come to appreciate the central importance of evolution in shaping the composition and dynamics of biological systems and processes. Yet, although evolution via natural selection is a simple concept, it has proven difficult to derive a quantitative theory of evolution. We sorely lack a detailed, quantitative understanding of the mechanisms and parameters that define genomic adaptation generally, not to mention how these features vary within the context of specific evolutionary pressures or combinations thereof, or how these features define evolutionary dynamics. These questions are not solely of academic interest. A contextualized, detailed knowledge of the evolutionary landscape surrounding specific biological systems may enable an array of novel solutions to issues such as antibiotic-resistance, epidemic management, cancer treatment, and genome engineering. Thus, there is much work ahead and plenty of space for breakthroughs in the field of evolutionary genomics.

In the November 2013 issue of PLoS Genetics, Daniel Kvitek and Gavin Sherlock (from our own Genetics Department) presented an elegant analysis of genomic adaptation in a constant environment1. Motivated by the hypothesis that signaling networks evolved to sense and respond to environmental fluctuations may also carry a fitness cost, Kvitek and Sherlock sought to refine the characterization of the landscape of adaptive mutations in S. cerevisiae cultures under continuous, nutrient-limited growth where specific, extant signaling networks may be of limited utility.

Messages from inside

The backdrop for the new study lies in a key experiment2 published in 2008 by the Sherlock lab, in which postdoc Katy Kao challenged classical models of adaptive evolution which postulated that adaptive clones arose serially, always deriving from the preceding, dominant adaptive lineage. In replicate experimental evolutions, cultures were seeded with equal quantities of three nearly isogenic, fluorescently-tagged (GFP, YFP, and DsRed), haploid (S288c) S. cerevisiae strains and grown under glucose-limited conditions. By tracking the abundance of fluorescently-labeled lineages (sub-populations), Kao and Sherlock were able to record coarse dynamics of population structure during adaptive evolution2. These dynamics revealed clear signals of clonal interference, where competing lineages expanded and shrunk, indicating that adaptive mutations were common enough that lineages with distinct driver mutations spread and compete simultaneously, as had been previously observed in viruses3 and bacteria4. Clonal interference, which has important consequences in evolution as it alters the rate of adaptation and the fate of adaptive lineages, was thus demonstrated in eukaryotes. It should be noted that clonal interference relies on large population sizes (>106), a regime that is relevant to many diseases. For example, a 1 cm3 tumor has ~109 cells. At diagnosis, the malignant cell population in leukemia exceeds 1012. Bacterial infections can have similar numbers of cells, and HIV infections can contain two-orders of magnitude higher viral counts.

Array-based genotyping of FACS-sorted clones (N=5) revealed that adaptive mutations were concentrated on the glucose sensing and RAS pathways2, and a follow-up study5 revealed strong negative epistasis among two of the most common targets of adaptive mutations –MTH1 and HXT6/7– as variants combining mutations in the two displayed lower fitness than wild-type. Both gain-of-function mutations –most commonly amplifications– of the high-affinity glucose transporters HXT6/7 and loss-of-function mutations in MTH1 –a negative regulator of the transporters provided a selective advantage by increasing the amount of glucose that is able to enter the cell and were recurrent in the evolving population, but never arose in the same background. A deep rift in the fitness landscape, created by intermolecular epistasis –non-additive fitness interactions between genes– separated lineages within a preferred pathway of adaptation.

These results suggested rich mutation diversity and population dynamics would underlie the observed levels of clonal interference, which the authors hypothesized could be probed with population-level, whole-genome sequencing through the course of experimental evolution. However, given the complexity of evolved yeast populations, with population sizes of ~109 cells, population-level sequencing introduces a number of challenges for sensitive and accurate determination of mutation alleles and frequencies, respectively. For example, even at a high sequencing depth of 1,000x, a per-base sequencing error of 1% as is standard in modern sequencing technologies would generate 10 reads with non-reference alleles by chance. Furthermore, only small (~1 ml) population samples had been stored in the freezer and thus sensitive (low input) library preparation methods would be required. Lastly, the ability to detect adaptive mutations from short (≤100 nt) population sequences is, by and large, restricted to single nucleotide polymorphisms (SNPs) and short sequence insertion/deletions (indels), whose allele frequencies rise above sequencing error levels.

Uncovering and monitoring adaptive SNPs in evolving populations

Towards these goals, the authors applied recently developed experimental techniques to sequence limited amounts of genomic DNA at decreased error rates, from the previously studied2, triplicate (haploid) S. cerevisiae cultures. For each experimental evolution (E1-3), the authors performed population-level sequencing at 8 timepoints, separated by ~70 generations, spanning ~450 generations of continuous growth. In total, 24 timepoints were sequenced. Concomitant fragmentation and adapter-tagging –so called enzymatic ‘tagmentation’– with the Nextera Tn5 transposase to reduce the number of requisite DNA clean-ups limited nucleic acid loss during library preparation, allowing robust library preparation from limited input samples.

Population-sequencing reads were mapped to a custom reference genome, corresponding to the ancestral GSY1136 strain. A fixed-barcode, wild-type library spiked into each population library permitted lane-specific, base-quality calibration and non-reference allele quantification. To determine SNPs from population-level sequences, allele counts at each position were compared to reference allele counts from wild-type libraries, and SNPs were called at sites with significant (multiple hypothesis- and FDR-corrected) enrichments in non-reference alleles. Positions with q-value < 0.01 were retained and SNPs were heuristically filtered for further quality refinement. Barcoded adapters allowed PCR-duplicate identification and removal, and paired read-overlap correction methods –as pioneered in deep mutational scanning6,7– allowed sufficient error rate reduction to support variant identification for mutant alleles present in as low as 1% of the population. With the exception of one, all SNPs with a maximum allele frequency ≥10% were validated.

What lies within: clonal interference and mutation cohorts

Applying these methods uncovered 117 mutations in 51 genes, of which 37% (19) were recurrently mutated, suggesting parallel evolution in distinct populations and lineages. Mutations fixed in only one of the three populations (E3). Assuming a DNA mutation rate of 10-10 per base, per generation, these mutations represent ~0.002% of all de novo mutations arising across populations.

The observed mutation dynamics reveal a striking prevalence of clonal interference, whereby beneficial mutations that rose in frequency often succumbed to alternate expanding lineages prior to fixing in the population (Fig. 1). This phenomenon strongly decouples the maximum and the final alleles frequencies in evolving populations, and introduces complex lineage dynamics. In effect, 63% of the identified mutations decreased in frequency from their maxima, and 36% of identified mutations –which necessarily rose to ≥1% frequencies for detection– become extinct. Thus, even within a continuous environment the fate of competing mutations cannot be predicted as diverse adaptive solutions continue to arise within heterogeneous genetic backgrounds. These results are consistent with previous observations of clonal interference in diverse evolving populations8-10 and point to the continued and therefore combinatorial introduction of adaptive –as well as neutral and mal-adaptive– mutations. Not skipping a beat, the authors rolled-up their sleeves and set out to address combinatorial mutations.

Figure 1. Visualizing the dynamics and linkage of mutations with frequencies above 10% in the E1 (A), E2 (B), and E3 (C) evolutions.

Figure 1. Visualizing the dynamics and linkage of mutations with frequencies above 10% in the E1 (A), E2 (B), and E3 (C) evolutions.

Genotyping individual adaptive clones by Sanger-sequencing (N≈102 clones), Kvitek and Sherlock uncovered extensive linkage among mutations above 10% frequency. Over 90% of these mutations occur in cis with other mutations. To untangle beneficial and neutral mutations, the authors deemed recurrent independent mutations as beneficial –a reasonable assumption given the that most mutations are deleterious7– and revealed that all defined lineages harbor at least one such beneficial mutation. In each evolution, the final, highest-frequency lineage harbored at least three beneficial mutations, indicating that lineage success frequently requires multiple beneficial mutations and is thus non-deterministic.

Naturally, the extent of clonal interference and mutation linkage are defined by population size, the rate of mutation, and the fraction of adaptive, neutral, and mal-adaptive mutations. These three latter parameters can vary strongly among selective environments, and even during lineage progression as demonstrated by decelerating fitness gains during adaptation11. These data invite quantitative modeling to derive the true underlying adaptive mutation rate to support the observed adaptive mutation (≥10%) frequency, taking advantage of the known mutation rate, population size, and generation times. However, such analysis would necessitate modeling the underlying fitness distribution of mutations.

The prevalence of genetic hitchhiking, whereby lineages with multiple mutations rise and fall during evolution, implies that epistasis can play a major role in shaping population dynamics and evolutionary trajectories. Consistently, recent reports12-15 have suggested wide-spread epistasis in genome and protein evolution. Whereas a large fraction of protein intramolecular epistasis is accounted for by a robustness/protein stability axis16, fundamental questions regarding the source of intermolecular epistasis remain. We now know that the decelerating fitness gains observed during adaptation can arise from diminishing-returns epistasis11, whereby the sequential combination of beneficial mutations leads to gradually lower fitness gains within lineages as pathway optimization proceeds. Conversely, sudden and dramatic changes can occur in evolving populations when innovative beneficial mutations arise that can exploit new ecological niches. These innovations can in turn show all-or-none epistasis17 when an evolved genetic background is a prerequisite for the adaptive phenotype. However, it is presently unclear whether intermolecular epistasis signatures are primarily due to exchange costs between adaptations to diverse and fluctuating environments, or whether epistasis occurs primarily among adaptive mutations to the same selective pressures.

Uncovering adaptive strategies

Monitoring the rise, fall and linkage of mutations revealed critical population dynamics at play in adaptation, but speaks not to the specific cellular strategies conferred by adaptive mutations. Here, pathway analysis revealed that 53% of mutations in recurrently mutated genes lay within three major signaling pathways: glucose signaling and transport, cAMP/PKA, and the high-osmolarity glycerol (HOG) pathway. In addition, recurrent mutations were observed in sterol metabolism and cell-cycle genes (ACE2, WHI2). Tellingly, nonsense mutations –which truncate proteins– were 7.6x higher in frequency than expected by chance, which provides evidence that disruption of signaling networks is a common and effective adaptive strategy to increase fitness in non-fluctuating environments.

These results demonstrate that signaling networks impose a fitness cost on cellular systems, and beg the question of what are the specific efficiencies gained, or rather, what are the inefficiencies removed, by disrupting signaling systems? Do the (competitive) fitness defects arise from opportunity costs of delayed response, from metabolic costs, or a combination thereof? Are phosphorylation cascades, with their copious requirements for ATP, particularly more sensitive to degradation under continuous evolution? Are upstream or downstream signaling components more susceptible to disruption and how feedback systems affect this balance? These results suggest that collaborations between the fields of experimental evolution and systems modeling may help (1) pinpoint the specific energy gains afforded by disrupting specific signaling components and systems (i.e., the MAPK pathway), (2) determine which signaling architectures are more robust or energy efficient, and (3) refine metabolic network circuitry. In addition, imbalances in metabolic flux may underlie a substantial portion of intermolecular epistatic interactions between adaptive mutations. Thus, coupling recent developments in cellular modeling18 with laboratory evolution may prove conducive to analyses at the intersection of evolution, systems biology, and epistasis.

Significance & Future Directions

The findings summarized here speak to the power of experimental evolution. Dissection of a single experimental evolution has uncovered clonal interference in a eukaryotic system, reciprocal sign epistasis, the widespread success of lineages with multiple beneficial mutations, and the fitness cost of signaling in a continuous environment. Striking features, such as clonal interference and the success of lineages with multiple beneficial mutations, speak to the frequency of adaptive mutations, and have now been observed in several systems. In an excellent study10 applying population-level sequencing to forty evolving cultures of S. cerevisiae grown in rich medium, these features were observed as nested sweeps –whereby one mutational sweep initiates before a previous has completed– and mutation cohorts, temporal clusters of mutations on shared genetic backgrounds. These features established, future work into the quantification of clonal interference, adaptive mutation rates, the prevalence of epistasis, and metabolic costs will allow refined mathematical models for predictive and diagnostic analysis of evolutionary processes.

Yet, significant technical challenges remain. For example, we do not at present know how much of the variants in fitness can be accounted by SNPs. Owing to the difficulty of assessing copy-number variants (CNVs) from short sequences, this important class of adaptive mutations has remained unstudied in population-level studies performed to-date. Importantly, previous experiments19 have shown that amplification of high-affinity glucose transporters (HXT6/7) are frequent genomic adaptations to glucose limitation and careful dissection of adaptive amplifications20 has hinted at novel mechanisms of DNA recombination21. Such limitations, however, may be eventually overcome by (long-awaited) long-read sequencing technologies and methods for comprehensive mutation tracking, serving to both enhance structural variant detection and enable haplotype-resolved, variant tracking. These developments would allow in-depth, genome-wide inquiry into the prevalence and role of epistasis in modulating accessibility to and reproducibility of adaptive mutations.

Similarly, increased sequencing depths coupled with reduced sequencing error rates will allow examination of increasingly lower-representation genotypes, allowing ever high-resolution analysis of population dynamics. For example, recently developed methods22 that combine rolling-circle amplification and population sequencing to achieve error rates of ~10-6 per base, have now been applied to study poliovirus evolution at unprecedented scale23. The drastically reduced error rates permitted Acevedo et al.23 to detect mutations at frequencies two-orders of magnitude below that of the reported mutation frequencies in poliovirus populations (~2 x 10-4 per base). Combined with the increased coverage (~200,000x) afforded by the small (~7.5 kb) genome, the approach revealed a staggering diversity of mutations, most of which are present at very low frequencies in the population (10-3-10-5)23.

The frequent observation of adaptive, loss-of-function mutations in multiple distinct pathways suggests that the specific selective pressure studied is permissive to large numbers of adaptive mutations. As gain-of-function mutations are infrequent relative to loss-of-function mutations, we can expect the adaptive mutation rate and dynamics of evolution in other environments to differ substantially. Future applications of whole-genome, whole-population sequencing approaches with increased read-length and fidelity will provide a fruitful avenue to reveal ever more intricate mechanisms of adaptive response.

Finally, it is likely that methods from the –currently human-biased– genome interpretation field may provide richer analyses into the functional roles of adaptive mutations. Phenotype ontologies, mutation prioritization tools, and improved methods for assessing the impact of coding (beyond first-generation programs such as SIFT and PolyPhen) and regulatory sequence mutations are as relevant to experimental evolution as to human diagnostics. In turn, we can expect findings from experiment evolution to help establish a framework for understanding the dynamics of aberrant cancer genomes, antibiotic resistance, and immune-evasion. Learning to detect signatures of selection and distinguish modes of population dynamics within these genomes may prove paramount to treatment.

Paper author: Gavin Sherlock is associate professor in the Genetics Department.

Paper author: Gavin Sherlock is associate professor in the Genetics Department.

Paper author: Dan Kvitek completed his Ph.D. in 2013 in the laboratory of Gavin Sherlock. Dan now combines experimental and computational research to diagnose genetic variants at Invitae.

Paper author: Dan Kvitek completed his Ph.D. in 2013 in the laboratory of Gavin Sherlock. Dan now combines experimental and computational research to diagnose genetic variants at Invitae.


1.            Kvitek, D. J. & Sherlock, G. Whole genome, whole population sequencing reveals that loss of signaling networks is the major adaptive strategy in a constant environment. PLoS Genet 9, e1003972 (2013).

2.            Kao, K. C. & Sherlock, G. Molecular characterization of clonal interference during adaptive evolution in asexual populations of Saccharomyces cerevisiae. Nat Genet 40, 1499–1504 (2008).

3.            Miralles, R., Gerrish, P. J., Moya, A. & Elena, S. F. Clonal interference and the evolution of RNA viruses. Science 285, 1745–1747 (1999).

4.            Perfeito, L., Fernandes, L., Mota, C. & Gordo, I. Adaptive mutations in bacteria: high rate and small effects. Science 317, 813–815 (2007).

5.            Kvitek, D. J. & Sherlock, G. Reciprocal sign epistasis between frequently experimentally evolved adaptive mutations causes a rugged fitness landscape. PLoS Genet 7, e1002056 (2011).

6.            Fowler, D. M. et al. High-resolution mapping of protein sequence-function relationships. Nat Methods 7, 741–746 (2010).

7.            Araya, C. L. et al. A fundamental protein property, thermodynamic stability, revealed solely from large-scale measurements of protein function. Proceedings of the National Academy of Sciences (2012). doi:10.1073/pnas.1209751109

8.            de Visser, J. A. G. M. & Rozen, D. E. Clonal interference and the periodic selection of new beneficial mutations in Escherichia coli. Genetics 172, 2093–2100 (2006).

9.            Toprak, E. et al. Evolutionary paths to antibiotic resistance under dynamically sustained drug selection. Nat Genet 44, 101–105 (2011).

10.         Lang, G. I. et al. Pervasive genetic hitchhiking and clonal interference in forty evolving yeast populations. Nature 500, 571–574 (2013).

11.         Chou, H.-H., Chiu, H.-C., Delaney, N. F., Segrè, D. & Marx, C. J. Diminishing returns epistasis among beneficial mutations decelerates adaptation. Science 332, 1190–1192 (2011).

12.         Corbett-Detig, R. B., Zhou, J., Clark, A. G., Hartl, D. L. & Ayroles, J. F. Genetic incompatibilities are widespread within species. Nature 504, 135–137 (2013).

13.         Breen, M. S., Kemena, C., Vlasov, P. K., Notredame, C. & Kondrashov, F. A. Epistasis as the primary factor in molecular evolution. Nature (2012). doi:10.1038/nature11510

14.         Natarajan, C. et al. Epistasis among adaptive mutations in deer mouse hemoglobin. Science 340, 1324–1327 (2013).

15.         Weinreich, D. M., Delaney, N. F., Depristo, M. A. & Hartl, D. L. Darwinian evolution can follow only very few mutational paths to fitter proteins. Science 312, 111–114 (2006).

16.         Bershtein, S., Segal, M., Bekerman, R., Tokuriki, N. & Tawfik, D. S. Robustness-epistasis link shapes the fitness landscape of a randomly drifting protein. Nature 444, 929–932 (2006).

17.         Meyer, J. R. et al. Repeatability and Contingency in the Evolution of a Key Innovation in Phage Lambda. Science 335, 428–432 (2012).

18.         Karr, J. R. et al. A whole-cell computational model predicts phenotype from genotype. Cell 150, 389–401 (2012).

19.         Gresham, D. et al. The repertoire and dynamics of evolutionary adaptations to controlled nutrient-limited environments in yeast. PLoS Genet 4, e1000303 (2008).

20.         Araya, C. L., Payen, C., Dunham, M. J. & Fields, S. Whole-genome sequencing of a laboratory-evolved yeast strain. BMC Genomics 11, 88 (2010).

21.         Brewer, B. J., Payen, C., Raghuraman, M. K. & Dunham, M. J. Origin-dependent inverted-repeat amplification: a replication-based model for generating palindromic amplicons. PLoS Genet 7, e1002016 (2011).

22.         Lou, D. I. et al. High-throughput DNA sequencing errors are reduced by orders of magnitude using circle sequencing. Proceedings of the National Academy of Sciences 110, 19872–19877 (2013).

23.         Acevedo, A., Brodsky, L. & Andino, R. Mutational and fitness landscapes of an RNA virus revealed through population sequencing. Nature 505, 686–690 (2014).

Learning from 69 sequenced Y chromosomes

Why the Y?

Blog author Amy Goldberg is a graduate students in Noah Rosenberg's lab.

Blog author Amy Goldberg is a graduate students in Noah Rosenberg’s lab.

While mitochondria have been extensively sequenced for decades because of their short length and abundance, the Y chromosome has been under-studied.  Unlike autosomal DNA, the mitochondria and (most of) the Y chromosome are inherited exclusively maternally and paternally, respectively.  Therefore, they do not undergo meiotic recombination.  Without recombination, mutations accumulate on a stable background, preserving a wealth of information about population history.  Each background, shared through a common ancestor, is called a haplogroup. To leverage this information, Poznik et al. set out to sequence 69 males from nine diverse human populations, including a large representation of African individuals.  The paper, published in Science last summer, is by Stanford graduate student David Poznik and a group lead by CEHG professor Dr. Carlos Bustamante.

The structure of the Y chromosome is complex, with large heterochromatic regions, pseudo-autosomal regions that recombine with the X chromosome, and repetitive elements, making mapping reads difficult.  But, the Y chromosome is haploid, allowing for accurate variant calls at lower coverage than the autosomes, which have heterozygotes.  Using high-throughput sequencing (3.1x mean coverage) and a haploid expectation-maximization algorithm, Poznik et al. called genotypes with an error rate around 0.1%. The paper developed important methods for analyzing high-throughput sequences of the difficult Y chromosome, including determining the subset of regions within which accurate genotypes can be called.

Reconstructing the human Y-chromosome tree

Poznik et al. constructed a phylogenetic tree of the Y chromosome using sequence data and a maximum likelihood approach.  While the overall structure of the tree was known, Poznik et al. were able to accurately calculate branch lengths based on the number of variants differing between individuals and resolve previously indeterminate finer structure.

Figure 2 of the paper: Y-chromosome phylogeny inferred from genomic sequencing.

Figure 2 of the paper: Y-chromosome phylogeny inferred from genomic sequencing.

Incredible African Diversity: One of the key findings of the paper was the depth of diversity within Africans lineages.  While both uniparental and autosomal markers have indicated an African root for human diversity, Poznik et al. find lineages within a single population, the San hunter-gatherers, that coalesce almost at the same time as the entire tree (see haplogroup A). This indicates African diversity and structure has existed for tens of thousands of years, and there is likely more to discover.  A large sample of African populations were considered, which lead to previously unseen structure within haplogroup B2, including structure not mirrored by modern population clustering, that dates to approximately 35,000 years ago.

Evidence of population expansionShort internal branches of the tree, such as those seen within haplogroup E and the non-African group FT, indicate periods of rapid population growth.  When a population expands quickly, new variants that might otherwise drift to extinction can persist.  A large number of coalescence events occur at the time of growth, as there were fewer lineages alive in the population before this time.  For non-African haplogroups, this pattern is likely a remnant of the Out of Africa migration.  For haplogroup E, this corresponds to the Bantu agricultural expansion.

Resolved Eurasian polytomy: Previously, the topology of the Eurasian tree separating haplogroups G-H-IJK was unresolved.  Because of the higher coverage sequencing for this study, Poznik et al. found a single variant, a C to T transition, that differentiates G from the other groups.  Haplogroup G retains the ancestral variant, while H-IJK share the derived variant and are therefore more closely related to each other.

Sequencing vs. genotyping

In contrast to previous studies, which analyzed small repetitive elements called microsatellites or small sets of single base-pair changes called SNPs, whole-genome sequencing data contains not only more information, but potentially more accurate information.  In particular, before the advent of high-throughput sequencing, SNPs were usually ascertained in a subset of individuals that did not capture worldwide diversity levels.  Therefore, diversity measures are often underestimated and biased.  Without sequence data, the branch lengths of the tree did not have a meaningful interpretation, and the depth of variation within Africa was not seen.

MRCA of Human Maternal and Paternal Lineages

There was a lot of public discussion spurred by the publication of Poznik’s paper last year.  The discussion mainly focused on their result that, contrary to previous estimates, the most recent common ancestor (MRCA) of all mitochondrial DNA lived at a similar time as that of all Y chromosomes.  Previous estimates put the mitochondrial TMRCA around 200 thousand years ago, with the Y chromosome coalescing a bit over 100 thousand years ago.  These different estimates for Y and mitochondria were often obtained through different sequencing and analysis methods, and are therefore less comparable.  In particular, varying estimates of the mutation rates have led to different TMRCA estimates.  By analyzing both the Y and mitochondria in the same framework, calibrated by archeological evidence and within-species comparisons, Poznik et al. found largely overlapping confidence intervals for the TMRCA of both Y and mitochondria.

But, should the coalescence times of the mitochondria and the Y chromosome be the same? Not necessarily.  While discrepancies between the mitochondria and Y chromosome have often been interpreted as sex-biased population histories or sizes, strictly neutral models can predict large differences between the two, as well.  Because neither the analyzed part of the Y chromosome nor the mitochondria undergo recombination, each acts as a single locus – and therefore represents the history of a single lineage.  For a population, there is a wide distribution of the ages when lineages would coalesce for a given population history, and these loci represent only two with largely independent histories (given the overall population history), therefore they may differ by chance alone.  Similarly, different loci across autosomal DNA have TMRCA ranging from thousands to millions of years. Additionally, as single loci, any effects of selection would distort the entire genealogy of the Y chromosome and mitochondria.

Future directions

Human population history is far from fully fleshed out, and Poznik et al. provide a framework to leverage increasingly available high-throughput sequencing of Y chromosomes.  The method used to calculate the mutation rate and TMRCA is a valuable contribution in itself, with applications to a wide range of evolutionary and ecological questions.  This study demonstrated that we have only characterized a fraction of worldwide diversity, particularly in Africa, and that increased sampling will be critical to parsing close and far ties in human history.


Poznik GD, Henn BM, Yee MC, Sliwerska E, Euskirchen GM, Lin AA, Snyder M, Quintana-Murci L, Kidd JM, Underhill PA, Bustamante CD. Sequencing Y chromosomes resolves discrepancy in time to common ancestor of males versus females. Science. 2013 Aug 2;341(6145):562-5. doi: 10.1126/science.1237619.

Paper author David Poznik is a PhD student in Carlos Bustamante's lab.

Paper author David Poznik is a PhD student in Carlos Bustamante’s lab.