# A Snapshot of #bigdatamed tweets and Facebook Posts

As you can see from the curated Twitter and Facebook feed included at the link above, Stanford’s Center for Computational, Evolutionary and Human Genomics (CEHG) had a strong presence at this year’s Big Data Conference. In addition to the numerous CEHG community members in the audience (and livestreaming the event), we also had CEHG faculty members Russ Altman, Euan Ashley, Carlos Bustamante, Mildred Cho, Hank Greely, Susan Holmes, and Julia Salzman on the stage, serving as session moderators and featured speakers

# BAPGXII Saturday May 30, 2015

Stanford is hosting the 12th Bay Area Population Genomics (BAPG) meeting. The Bay Area Population Genomics meeting is a great place to (re)connect with your pop gen/genomics colleagues in the area and to present your work in a talk or a poster.

BAPGXI, held in December at UC Davis, was a great event with over 100 participants and a line up of excellent talks. Thanks to the Coop lab! You can read more here, including the storified tweets. We are excited to continue this success at Stanford!

Logistics

The meeting will take place on May 30th on the Stanford campus in the Alway building, room M106. We start at 8:30AM with breakfast and registration, Dr. Dmitri Petrov’s opening remarks will begin at 9:25am, and the first talk will be at 9:30am. The last talk (Dr. Jonathan Pritchard’s keynote) ends at 2:10pm, followed by a poster session with amazing wine, beer, and cheese! Here is a general outline of the agenda, to help you plan your day:

Breakfast and Registration in Alway Courtyard 8:30-9:25am (pick up your BAPGXII gift!)
Opening Remarks 9:25-9:30am
Talk Session 1 9:30-10:30am (20 mins per talk)
Coffee Break in Courtyard 10:30-11am
Talk Session 2 11am-12pm (20 mins per talk)
Lunch in Courtyard 12-1pm
Talk Session 3 and Keynote 1-2:10pm (2 20 min talks and 1 30 min talk)
Poster Session with Wine, Beer, and Cheese Reception at 2:10pm, ends at 3pm

Talks and Posters

Sorry. Speaker and poster slots are now full. No longer accepting sign-ups.

How to Attend BAPGXII

1. Please register here by 10am Friday, May 29th to join us at BAPGXII. Registration is free and open to all, but required.

2. Encourage your colleagues to sign up! Forward this email to your lab mailing list and watch for updates on the CEHG Facebook page and on Twitter @StanfordCEHG. Help us get the momentum going by tweeting us using #BAPGXII.

3. And finally, once you’ve signed up, all you need to do is get up early and ride-share, VTA/Caltrain or bike to our beautiful campus on May 30th. Come for the science, stay for the social! Use the Stanford campus map and this Google Map to find the Alway Building, located at 500 Pasteur Drive, Stanford, CA. Be green and consider ride-sharing: there is a dedicated tab for making travel plans in the sign up doc!

We hope to see you at Stanford!

The BAPGXII organizing committee: Bridget Algee-Hewitt (@BridgetAH), David Enard (@DavidEnard), Katie Kanagawa (@KatieKanagawa), Alison Nguyen, Dmitri Petrov (@PetrovADmitri), Susanne Tilk, and Elena Yujuico. If you have any questions, feel free to contact Bridget Algee-Hewitt at bridgeta@stanford.edu.

# Imagining Phylogenetics and Recombination as Art

Daniel Friedman is a 1st year Ph.D. student in the Ecology and Evolution program. Working from the Gordon lab, he mainly studies the evolution of collective behavior in ants. Other research interests include fractals, burritos, and metaphors. Contact: dfri@stanford.edu.

1. Phylogenetics
Ever since “I Think…”, the idea of a bifurcating tree of species relations has guided evolutionary biology. This piece of paper with ink on it plays with the idea of an “evolutionary I”, styled as an evolving Eye. Whether we perform molecular studies on the ontogenetic role of Pax6, or psychophysical explorations into the Self, we are confronted with questions of homology and convergence. Our time-reversible phylogenetic algorithms, so designed for computational simplicity, only contribute to this problem of post hoc ergo propter hoc – “after this, therefore because of this.” The Modern Synthesis was clearly a rEvolutionary moment – now are we ready for a Post-Modern Synthesis?
2. Recombination
DNA recombination is key to many biological processes. Recombination between homologous chromosomes during meiosis creates novel combinations of alleles, and to many, is the teleological “Why?” of Sex. But the reach of recombination goes far, far beyond Sex. Recombination between alleles of the same locus allows a kaleidoscope of DNA error-correcting mechanisms to proceed. And over evolutionary time scales, “errors” in recombination provide large structural creativities in the genome, such as duplication, deletions, and inversions. Recombination during immune cell maturation allows the human body to recognize an essentially infinite cohort of potential invaders. And now that recombination has been mechanistically deconvoluted, derived technologies facilitate guided DNA editing in vitro and in vivo . Recombination is molecular innovation embodied, a topological whirligig, and the workhorse of the genome.

# Fast Algorithm Infers Relationships in High-Dimensional Datasets

Post author Henry Li is a graduate student in the Wong Lab.

New research harnesses the powers of single value decomposition (SVD) and sparse learning to tackle the problem of inferring relationships between predictors and responses in large-scale, high-dimensional datasets.

Addressing problems in computation speed, assumptions of scarcity, and algorithm sensitivity

One major challenge that statisticians face when inferring relationships is that modern data is big and the underlying true relationships between predictors and responses are sparse and multilayered. To quickly establish connections in these datasets, Ma et al. utilize a combination of SVD and sparse learning, called thresholding SVD (T-SVD). This new algorithm solves many issues that plagued the Statistics and Big Data communities, such as the problems of computation speed, the assumption of sparcity, and the sensitivity of the algorithm to positive results. In their simulation study, T-SVD is shown to be better in relation to speed and sensitivity than existing methods such as the sequential extraction algorithm (SEA) and the iterative exclusive extraction algorithm (IEEA). As a result, the multilayered relationships between predictors and responses, which come in the form of multidimensional matrices, can be learned quickly and accurately.

Uncovering new regulatory networks

Demonstrating the application of T-SVD, Ma et al. showed that new biological insights can be gained from using T-SVD to analyze datasets from The Cancer Genome Atlas consortium. The authors focused on the ovarian cancer gene expression datasets, in which the sample size is much smaller than the number of regulators and responses measured in the study. As in a typical genomic experiment, tens of thousands of genes were probed for their expression levels; from pathway studies, we know that very few of these genes form control switches that govern the expression levels for the rest of the genome. Ma et al. inferred two different relationships, based on microRNA (miRNA) or long noncoding RNA (lncRNA). The authors showed that these regulatory relationships specifically match established cancer pathways very well. Geneticists now have two new regulatory networks to mine for understanding the roles of miRNAs and lncRNAs.

In short, T-SVD is an exciting algorithm that pushes the Statistics field forward by offering a new lens to look at large-scale multidimensional datasets. With this approach, statisticians and users of statistics, like geneticists, can gain new insights into existing datasets and tackle new research problems.

References

Ma, Xin, Luo Xiao, Wing Hung Wong. Learning regulatory programs by threshold SVD regression. Proc Natl Acad SCI USA. 2014 Nov 4; 111 (44). DOI 10.1073/pnas.1417808111

Paper author, Xin (Maria) Ma is a research associate in the Wong Lab.

# Afterword: CEHG Genetics and Society Symposium 2015

Founded in 2012, CEHG is a research program that fosters interdisciplinary research. Home to more than 25 faculty and more than 200 grads and postdocs, CEHG bridges the divides between various member labs across Stanford campus.

The 2015 CEHG Genetics and Society Symposium (GSS15), which took place on April 13th and 14th in Stanford’s Paul Brest Hall, was a smashing success. It featured 25 speakers from Stanford campus and the San Francisco Bay academic and scientific industry communities. Approximately 175 Stanford affiliates and non-affiliates came together to celebrate the Center’s spirit of interdisciplinary collaboration and meet with experts in the fields of computational, evolutionary and human genomics This is a significant increase from last year’s 150 attendees!

The Mission:

The Genetics and Society Symposium is integral to CEHG’s mission: it provides postdocs and graduate fellows with the opportunity to share their developing research with faculty advisors and their colleagues, encourages conversation between faculty working in diverse scientific disciplines across campus, and introduces CEHG members to speakers from around the Bay Area and beyond (and vice versa).

The Venue:

As you can see from our photos of the space and catering service, Paul Brest Hall was the perfect home for this year’s two-day symposium. The hall was spacious, the food delicious, the staff hands on, and the outdoor picnic area well suited for our lunch and coffee breaks. We enjoyed the venue so much, in fact, that CEHG staff are currently in the process of booking the space for next year!

The Speakers:

GSS15 featured four brilliant keynote speakers, each distinguished in his/her field of research.

Gene Myers and CEHG Exec Committee members Marc Feldman, Chiara Sabatti, and Carlos Bustamante

Founding director of a new Systems Biology Center at the Max-Planck Institute of Molecular Cell Biology and Genetics, Dr. Eugene (Gene) Myers presented his open-sourced research on the resurrection of de novo DNA sequencing. Best known for the development of BLAST, the most widely used tool in bioinformatics and the assembler he developed at Celera that delivered the fly, human, and mouse genomes in a three-year period, Dr. Myers participated in GSS15, courtesy of DNAnexus. Follow his blog: https://github.com/thegenemyers.

Co-founding director Carlos Bustamante and Ed Green catch up during a break at GSS15.

Assistant Professor in Biomolecular Engineering at the University of California, Santa Cruz, Richard (Ed) Green presented his research on a novel approach for highly contiguous genome assemblies, which draws on his work as an NSF Fellow at the Max Planck Institute in Leipzig, Germany and head of an analysis consortium responsible for publishing the draft genome sequence of Neanderthal. Click here for his 2014 CARTA talk, “The Genetics of Humanness: The Neanderthal and Denisovan Genomes.

Dr. Michelle Mello, Stanford Law School and School of Medicine

Michelle Mello, Professor of Law at Stanford Law School and Professor of Health Research and Policy in Stanford’s School of Medicine, presented findings from her extensive research on the ethics of data sharing. As the author of more than 140 articles and book chapters on the medical malpractice system, medical errors and patient safety, public health law, research ethics, the obesity epidemic, and pharmaceuticals, Dr. Mello provided a valuable perspective from the intersections of law, ethics, and health policy. Click here to read Dr. Mello’s SLS profile.

Dr. Ami Bhatt, Stanford Medicine

Ami Bhatt shared her passion for improving outcomes for patients with hematological malignancies in her talk, “Bugs, drugs, and cancer.” Best known for her recent work demonstrating the discovery of a novel bacterium using sequence-based analysis of a diseased human tissue, her research has been presented nationally and internationally and published in 2013 in the New England Journal of Medicine. Click here for links to Dr. Bhatt’s CAP profile and lab homepage.

We had a large group of CEHG faculty members at this year’s event, showcasing the cutting edge research being done in CEHG labs across Stanford campus and indicating considerable faculty commitment to ensuring the Center’s continuing success.

Our symposium would not be complete without our invited CEHG Fellows. These speakers were nominated by organizing committee members to present on topics relating to their CEHG-funded research projects. These young scholars embody CEHG’s continuing commitment to provide funding support to researchers as they transition from graduate studies to postdoctoral scholarships.

The Workshop:

There was standing room only when facilitators Chiara Sabatti (Associate Professor of Health Research and Policy at Stanford), Ken Lange (Chair of the Human Genetics Department at UCLA), and Suyash Shringarpure (postdoctoral scholar in Stanford’s Bustamante Lab) presented their approaches to contemporary problems in statistical genetics!

Social Media:

Did you know? CEHG is on social media!

GSS15 social media moderators, Bridget Algee-Hewitt, Jeremy Hsu, Katie Kanagawa, and Rajiv McCoy were posting live throughout both days of the event. And our efforts to reach the larger community paid off, with a total reach of 815 on Facebook and more than 7,000 impressions on Twitter!

To catch up on our GSS15 coverage, check out our Facebook page at https://www.facebook.com/StanfordCEHG?ref=hl and our Twitter feed @StanfordCEHG. Follow both to make sure you are the first to know when we post CEHG-related news and announcements.

Want to know when speaker videos from the symposium will be available on CEHG’s forthcoming youtube channel? Follow us on Facebook and Twitter!

Special Thanks:

From left to right: Bridget Algee-Hewitt, Cody Sam, Yang Li, Anand Bhaskar, and Katie Kanagawa

The GSS15 organizing committee—including Bridget Algee-Hewitt, Anand Bhaskar, Katie Kanagawa, Yang Li, and Cody Sam—would like to take this opportunity to thank CEHG Directors Carlos Bustamante and Marc Feldman, Executive Committee members Hank Greely, Dmitri Petrov, Noah Rosenberg, and Chiara Sabatti, event volunteers Alex Adams, Maude David, and Chris Gignoux, event photographer Deneb Semprum, and everyone who attended this year’s symposium.

We hope you enjoyed attending as much as we enjoyed working behind-the-scenes. We hope to see you all again at GSS16! If you are interested in volunteering for future CEHG events, please contact us at stanfordcehg@stanford.edu.

Upcoming CEHG events:

Don’t miss our popular weekly Evolgenome seminar series, which will continue through Spring term, usually on Wednesdays at noon (location varies). Lunch is always provided. Details will follow, but here is a quick overview so you can mark your calendars!

April 29: Fernando Racimo (Nielsen/Slatkin Lab)
May 6: Pleuni Pennings (UCSF)
May 20: Kelly Harkin
June 3: Sandeep Ventakaram (Petrov Lab)
June 10: Emilia Huerta-Sanchez

# A fast and accurate coalescent approximation

Blog author Suyash Shringarpure is a postdoc in Carlos Bustamante’s lab. Suyash is interested in statistical and computational problems involved in the analysis of biological data.

The coalescent model is a powerful tool in the population geneticist’s toolbox. It traces the history of a sample back to its most recent common ancestor (MRCA) by looking at coalescence events between pairs of lineages. Starting from assumptions of random mating, selective neutrality, and constant population size, the coalescent uses a simple stochastic process that allows us to study properties of genealogies, such as the time to the MRCA and the length of the genealogy, analytically and through efficient simulation. Extensions to the coalescent allow us to incorporate effects of mutation, recombination, selection and demographic events in the coalescent model. A short introduction to the coalescent model can be found here and a longer, more detailed introduction can be read here.

However, coalescent analyses can be slow or suffer from numerical instability, especially for large samples. In a study published earlier this year in Theoretical Population Biology, CEHG fellow Ethan Jewett and CEHG professor Noah Rosenberg proposed fast and accurate approximations to general coalescent formulas and procedures for applying such approximations. Their work also examined the asymptotic behavior of existing coalescent approximations analytically and empirically.

## Computational challenges with the coalescent

For a given sample, there are many possible genealogical histories, i.e., tree topologies and branch lengths, which are consistent with the allelic states of the sample. Analyses involving the coalescent therefore often require us to condition on a specific genealogical property and then sum over all possible genealogies that display the property, weighted by the probability of the genealogy. A genealogical property that is often conditioned on is $n_t$, the number of ancestral lineages in the genealogy at a time $t$ in the past. However, computing the distribution $P(n_t)$ of $n_t$ is computationally expensive for large samples and can suffer from numerical instability.

## A general approximation procedure for formulas conditioning on $n_t$

Coalescent formulas conditioning on $n_t$ typically involve sums of the form $f(x)=\sum_{n_t} f(x|n_t) \cdot P(n_t)$

For large samples and recent times, these computations have two drawbacks:

–       The range of possible values for $n_t$ may be quite large (especially if multiple populations are being analyzed) and a summation over these values may be computationally expensive.

–       Expressions for $P(n_t)$ are susceptible to round-off errors.

Slatkin (2000) proposed an approximation to the summation in $f(x)$ by a single term $f(x|E[n_t])$. This deterministic approximation was based on the observation that $n_t$ changes almost deterministically over time, even though it is a stochastic variable in theory. Thus we can write $n_t \approx E[n_t]$. From Figure 2 in the paper (reproduced here), we can see that this approximation is quite accurate. The authors prove the asymptotic accuracy of this approximation and also prove that under regularity assumptions, $f(x|E[n_t])$ converges to $f(x)$ uniformly in the limits of $t \rightarrow 0$ and $t \rightarrow \infty$ . This is an important result since it shows that the general procedure produces a good approximation for both very recent and very ancient history of the sample. Further, the paper shows how this method can be used to approximate quantities that depend on the trajectory of $n_t$ over time, which can be used to calculate interesting quantities such as the expected number of segregating sites in a genealogy.

## Approximating $E[n_t]$ for single populations

A difficulty with using the deterministic approximation is that $E[n_t]$ often has no closed-form formula, and if one exists, it is typically not easy to compute when the sample is large.

For a single population with changing size, two deterministic approximations have previously been developed (one by Slatkin and Rannala 1997, Volz et al. 2009 and one by Frost and Volz, 2010, Maruvka et al., 2011). Using theoretical and empirical methods, the authors examine the asymptotic behavior and computational complexity of these approximations and a Gaussian approximation by Griffiths. A summary of their results is in the table below.

 Method Accuracy Griffith’s approximation Accurate for large samples and recent history. Slatkin and Rannala (1997), Volz et al. (2009) Accurate for recent history and arbitrary sample size, inaccurate for very ancient history. Frost and Volz (2010), Maruvka et al. (2011) Accurate for both recent and ancient history and for arbitrary sample size. Jewett and Rosenberg (2014) Accurate for both recent and ancient history and arbitrary sample size, and for multiple populations with migration.

## Approximating $E[n_t]$ for multiple populations

Existing approaches only work for single populations of changing size and cannot account for migration between multiple populations. Ethan and Noah extend the framework for single populations to allow multiple populations with migration. The result is a system of simultaneous differential equations, one for each population. While it does not allow for analytical solutions except in very special cases, the system can be easily solved numerically for any given demographic scenario.

## Significance of this work

The extension of the coalescent framework to multiple populations with migration is an important result for demographic inference. The extended framework with multiple populations allows efficient computation of demographically informative quantities such as the expected number of private alleles in a sample, divergence times between populations.

Ethan and Noah describe a general procedure that can be used to approximate coalescent formulas that involve summing over distributions conditioned on $n_t$ or the trajectory of $n_t$ over time. This procedure is particularly accurate for studying very recent or very ancient genealogical history.

The analysis of existing approximations to $E[n_t]$ show that different approximations have different asymptotic behavior and computational complexities. The choice of which approximation to use is therefore often a tradeoff between the computational complexity of the approximation and the likely behavior of the approximation in the parameter ranges of interest.

## Future Directions

As increasingly large genomic samples from populations with complex demographic histories become available for study, exact methods either become intractable or very slow. This work adds to a growing set of approximations to the coalescent and its extensions, joining other methods such as conditional sampling distributions and the sequentially markov coalescent. Ethan and Noah are already exploring applications of these approximate methods to reconciling gene trees with species trees. In the future, I expect that these and other approximations will be important for fast and accurate analysis of large genomic datasets.

## References

[1] Jewett, E. M., & Rosenberg, N. A. (2014). Theory and applications of a deterministic approximation to the coalescent model. Theoretical population biology.

[2] Griffiths, R. C. (1984). Asymptotic line-of-descent distributions. Journal of Mathematical Biology21(1), 67-75.

[3] Frost, S. D., & Volz, E. M. (2010). Viral phylodynamics and the search for an ‘effective number of infections’. Philosophical Transactions of the Royal Society B: Biological Sciences365(1548), 1879-1890.

[4] Maruvka, Y. E., Shnerb, N. M., Bar-Yam, Y., & Wakeley, J. (2011). Recovering population parameters from a single gene genealogy: an unbiased estimator of the growth rate. Molecular biology and evolution28(5), 1617-1631.

[5] Slatkin, M., & Rannala, B. (1997). Estimating the age of alleles by use of intraallelic variability. American journal of human genetics60(2), 447.

[6] Slatkin, M. (2000). Allele age and a test for selection on rare alleles.Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences355(1403), 1663-1668.

[7] Volz, E. M., Pond, S. L. K., Ward, M. J., Brown, A. J. L., & Frost, S. D. (2009). Phylodynamics of infectious disease epidemics. Genetics183(4), 1421-1430.

Paper author Ethan Jewett is a PhD student in the lab of Noah Rosenberg.

# Learning from 69 sequenced Y chromosomes

## Why the Y?

Blog author Amy Goldberg is a graduate students in Noah Rosenberg’s lab.

While mitochondria have been extensively sequenced for decades because of their short length and abundance, the Y chromosome has been under-studied.  Unlike autosomal DNA, the mitochondria and (most of) the Y chromosome are inherited exclusively maternally and paternally, respectively.  Therefore, they do not undergo meiotic recombination.  Without recombination, mutations accumulate on a stable background, preserving a wealth of information about population history.  Each background, shared through a common ancestor, is called a haplogroup. To leverage this information, Poznik et al. set out to sequence 69 males from nine diverse human populations, including a large representation of African individuals.  The paper, published in Science last summer, is by Stanford graduate student David Poznik and a group lead by CEHG professor Dr. Carlos Bustamante.

The structure of the Y chromosome is complex, with large heterochromatic regions, pseudo-autosomal regions that recombine with the X chromosome, and repetitive elements, making mapping reads difficult.  But, the Y chromosome is haploid, allowing for accurate variant calls at lower coverage than the autosomes, which have heterozygotes.  Using high-throughput sequencing (3.1x mean coverage) and a haploid expectation-maximization algorithm, Poznik et al. called genotypes with an error rate around 0.1%. The paper developed important methods for analyzing high-throughput sequences of the difficult Y chromosome, including determining the subset of regions within which accurate genotypes can be called.

## Reconstructing the human Y-chromosome tree

Poznik et al. constructed a phylogenetic tree of the Y chromosome using sequence data and a maximum likelihood approach.  While the overall structure of the tree was known, Poznik et al. were able to accurately calculate branch lengths based on the number of variants differing between individuals and resolve previously indeterminate finer structure.

Figure 2 of the paper: Y-chromosome phylogeny inferred from genomic sequencing.

Incredible African Diversity: One of the key findings of the paper was the depth of diversity within Africans lineages.  While both uniparental and autosomal markers have indicated an African root for human diversity, Poznik et al. find lineages within a single population, the San hunter-gatherers, that coalesce almost at the same time as the entire tree (see haplogroup A). This indicates African diversity and structure has existed for tens of thousands of years, and there is likely more to discover.  A large sample of African populations were considered, which lead to previously unseen structure within haplogroup B2, including structure not mirrored by modern population clustering, that dates to approximately 35,000 years ago.

Evidence of population expansionShort internal branches of the tree, such as those seen within haplogroup E and the non-African group FT, indicate periods of rapid population growth.  When a population expands quickly, new variants that might otherwise drift to extinction can persist.  A large number of coalescence events occur at the time of growth, as there were fewer lineages alive in the population before this time.  For non-African haplogroups, this pattern is likely a remnant of the Out of Africa migration.  For haplogroup E, this corresponds to the Bantu agricultural expansion.

Resolved Eurasian polytomy: Previously, the topology of the Eurasian tree separating haplogroups G-H-IJK was unresolved.  Because of the higher coverage sequencing for this study, Poznik et al. found a single variant, a C to T transition, that differentiates G from the other groups.  Haplogroup G retains the ancestral variant, while H-IJK share the derived variant and are therefore more closely related to each other.

## Sequencing vs. genotyping

In contrast to previous studies, which analyzed small repetitive elements called microsatellites or small sets of single base-pair changes called SNPs, whole-genome sequencing data contains not only more information, but potentially more accurate information.  In particular, before the advent of high-throughput sequencing, SNPs were usually ascertained in a subset of individuals that did not capture worldwide diversity levels.  Therefore, diversity measures are often underestimated and biased.  Without sequence data, the branch lengths of the tree did not have a meaningful interpretation, and the depth of variation within Africa was not seen.

## MRCA of Human Maternal and Paternal Lineages

There was a lot of public discussion spurred by the publication of Poznik’s paper last year.  The discussion mainly focused on their result that, contrary to previous estimates, the most recent common ancestor (MRCA) of all mitochondrial DNA lived at a similar time as that of all Y chromosomes.  Previous estimates put the mitochondrial TMRCA around 200 thousand years ago, with the Y chromosome coalescing a bit over 100 thousand years ago.  These different estimates for Y and mitochondria were often obtained through different sequencing and analysis methods, and are therefore less comparable.  In particular, varying estimates of the mutation rates have led to different TMRCA estimates.  By analyzing both the Y and mitochondria in the same framework, calibrated by archeological evidence and within-species comparisons, Poznik et al. found largely overlapping confidence intervals for the TMRCA of both Y and mitochondria.

But, should the coalescence times of the mitochondria and the Y chromosome be the same? Not necessarily.  While discrepancies between the mitochondria and Y chromosome have often been interpreted as sex-biased population histories or sizes, strictly neutral models can predict large differences between the two, as well.  Because neither the analyzed part of the Y chromosome nor the mitochondria undergo recombination, each acts as a single locus – and therefore represents the history of a single lineage.  For a population, there is a wide distribution of the ages when lineages would coalesce for a given population history, and these loci represent only two with largely independent histories (given the overall population history), therefore they may differ by chance alone.  Similarly, different loci across autosomal DNA have TMRCA ranging from thousands to millions of years. Additionally, as single loci, any effects of selection would distort the entire genealogy of the Y chromosome and mitochondria.

## Future directions

Human population history is far from fully fleshed out, and Poznik et al. provide a framework to leverage increasingly available high-throughput sequencing of Y chromosomes.  The method used to calculate the mutation rate and TMRCA is a valuable contribution in itself, with applications to a wide range of evolutionary and ecological questions.  This study demonstrated that we have only characterized a fraction of worldwide diversity, particularly in Africa, and that increased sampling will be critical to parsing close and far ties in human history.

## Reference

Poznik GD, Henn BM, Yee MC, Sliwerska E, Euskirchen GM, Lin AA, Snyder M, Quintana-Murci L, Kidd JM, Underhill PA, Bustamante CD. Sequencing Y chromosomes resolves discrepancy in time to common ancestor of males versus females. Science. 2013 Aug 2;341(6145):562-5. doi: 10.1126/science.1237619.

Paper author David Poznik is a PhD student in Carlos Bustamante’s lab.

# Caught in the act: how drug-resistance mutations sweep through populations of HIV

Blog author Meredith Carpenter is a postdoc in Carlos Bustamante’s lab.

It has been over 30 years since the emergence of HIV/AIDS, yet the disease continues to kill over one million people worldwide per year [UNAIDS report]. One of the reasons that this epidemic has been so difficult to control is because HIV evolves quickly—it has a short replication time and a high mutation rate, so viruses harboring new mutations that confer drug resistance tend to arise often and spread quickly.

However, the likelihood of one of these beneficial mutations popping up and subsequently “sweeping” through the viral population—i.e., becoming more common because of the survival advantage—also depends on the underlying population genetics, much of which is still poorly understood. In a paper just published in PLoS Genetics, Pleuni Pennings, postdoc in the Petrov lab, and colleagues Sergey Kryazhimskiy and John Wakeley from Harvard tracked the genetic diversity in adapting populations of HIV to better understand how and when new mutations arise.

## Mutations and populations

Mutations are usually caused by either DNA damage (e.g., from environmental factors like UV radiation) or by a mistake during DNA replication. Because HIV is a retrovirus, meaning it must copy its RNA genome into DNA before it can be reproduced in the host cell, it is especially prone to errors that happen during the replication process. The rate that these errors occur, also called the mutation rate, is constant on a per-virus basis —for example, a specific mutation might happen in one virus in a million. As a consequence, the overall number of viruses in the population determines how many new mutations will be present, with a larger population harboring more mutations at any given time.

Whether these mutations will survive, however, is related to what population geneticists call the “effective population size” (also known as Ne), which takes into account genetic diversity. Due to a combination of factors, including the purely random destruction of some viruses, not all mutations will be preserved in the population, regardless of how beneficial they are. The Ne is a purely theoretical measure that can tell us how easily and quickly a new mutation can spread throughout a population. Because it accounts for factors that affect diversity, it is usually smaller than the actual (or “census”) population size.

Pennings and colleagues wanted to determine the Ne for HIV in a typical patient undergoing drug treatment. This is a contentious area: previous researchers examining this question using different methods, including simply summing up overall mutation numbers, came up with estimates of Ne ranging from one thousand to one million (in contrast, the actual number of virus-producing cells in the body is closer to one hundred million, but more on that later). To get a more exact estimate, Pennings took a new approach. Using previously published DNA sequences of HIV sampled from patients over the course of a drug treatment regimen, she looked at the actual dynamics of the development of drug-resistant virus populations over time.

## Swept away

Specifically, Pennings focused on selective sweeps, wherein an advantageous mutation appears and then rises in frequency in the population. Features of these sweeps can give estimates of Ne because they reveal information about the diversity present in the initial population. Pennings sought to distinguish between “hard” and “soft” selective sweeps occurring as the viruses became drug resistant. A hard sweep occurs when a mutation appears in one virus and then rises in frequency, whereas a soft sweep happens when multiple viruses independently gain different mutations, which again rise in frequency over time (see Figure 1). These two types of sweeps have distinct fingerprints, and their relative frequencies depend on the underlying effective population size—soft sweeps are more likely when a population is larger it becomes more likely for different beneficial mutations to independently arise in two different viruses. Soft sweeps also leave more diversity in the adapted population compared to hard sweeps (Figure 1).

Figure 1, an illustration of a hard sweep (left) and a soft sweep (right).

To tell these types of sweeps apart, Pennings took advantage of a specific amino acid change in the HIV gene that encodes reverse transcriptase (RT). This change can result from two different nucleotide changes, either one of which will change the amino acid from lysine to asparagine and confer resistance to drugs that target the RT protein.  Pennings used this handy feature to identify hard and soft sweeps: if she observed both mutations in the same drug-resistant population, then the sweep was soft. If only one mutation was observed, the sweep could be soft or hard, so she also factored in diversity levels to tell these apart. Pennings found evidence of both hard and soft sweeps in her study populations. Based on the frequencies of each, she estimated the Ne of HIV in the patients. Her estimate was 150,000, which is higher than some previous estimates but still lower than the actual number of virus-infected cells in the body. Pennings suggests that this discrepancy could be due to the background effects of other mutations in the viruses that gain the drug-resistance mutation—that is, even if a virus gets the valuable resistance mutation, it might still end up disappearing from the population because it happened to harbor some other damaging mutation as well. This would reduce the effective population size as measured by selective sweeps.

## Implications and future work

Pennings’ findings have several implications. The first is that HIV populations have a limited supply of resistance mutations, as evidenced by the presence of hard sweeps (which, remember, occur when a sweep starts from a single mutation). This means that even small reductions in Ne, such as those produced by combination drug therapies, could have a big impact on preventing drug resistance. The second relates to the fact that, as described above, the likelihood that a mutation will sweep the population may be affected by background mutations in the virus in which it appears. This finding suggests that mutagenic drugs, given in combination with standard antiretrovirals, could be particularly useful for reducing drug resistance.  Now, Pennings is using larger datasets to determine whether some types of drugs lead to fewer soft sweeps (presumably because they reduce Ne). She is also trying to understand why drug resistance in HIV evolves in a stepwise fashion (one mutation at a time), even if three drugs are used in combination.

Paper author Pleuni Pennings is a postdoc in the lab of Dmitri Petrov.

## References

Pennings, PS, Kryazhimskiy S, Wakeley J. Loss and recovery of genetic diversity in adapting HIV populations. 2014, PLoS Genetics.

# Using phyloseq for the reproducible analysis of high-throughput sequencing data in microbial ecology

Blog author Diana Proctor is a graduate student in David Relman’s lab.

The Problem: Data Availability & Scientific Reproducibility

A Current Biology (1) paper evaluating the accessibility of scientific data recently inspired articles and blog posts (2, 3) as well as a lively conversation on Reddit about the “alarming rate of data disappearance” (4). Solutions to the problem of disappearing data include the NIH data sharing policy, as well as data sharing policies set by scientific journals, requiring the deposition of data into public repositories.

As a trainee in David Relman’s Lab thinking about the eventual fate of the high-throughput, next generation sequencing data generated over the course of my dissertation (http://www.hyposalivation.com), this conversation about data accessibility brings to mind a related question – how can I ensure that my data are not lost as fast as the current biology paper predicts?

The Solution: Phyloseq allows microbial ecologists to make reproducible research reports

The solution to data disappearance probably needs to involve not only deposition of data into public repositories, but also the widespread use of reproducible research reports. Luckily for those microbial ecologists among us, Paul McMurdie and Susan Holmes of Stanford University developed an R-based Bioconductor package (i.e., a package for bioinformatics) called phyloseq to facilitate the reproducible statistical analysis of high throughput phylogenetic sequencing datasets, including those generated by barcoded amplicon sequencing, metabolomic, and metagenomic experiments (5, 6). Phyloseq, initially released in 2012, was recently updated by McMurdie & Holmes, and described in an April 2013 publication (6).

Phyloseq Key Features

Phyloseq allows the user to import a species x sample data matrix (aka, an OTU Table) or data matrices from metagenomic, metabolomic, and/or other –omics type experiments into the R computing environment. Previous R extensions, such as OTUbase, also have the capacity to import these data matrices into R, but phyloseq is unique in that it allows the user to integrate the OTU Table, the phylogenetic tree, the “representative sequence” fasta file, and the metadata mapping file into a single “phyloseq-class” R object. The microbial ecologist can then harness all the statistical and graphical tools available in R, including Knitr, R-Markdown and ggplot2 (among others), to generate reproducible research reports with beautiful graphics, as detailed below. To see the report McMurdie used to prepare the phyloseq publication, visit this link: http://www.hyposalivation.com/wp-content/uploads/2014/01/phyloseq_plos1_2012-source-doc.html.

1. Phyloseq incidentally allows the user to curate data

When phyloseq imports the myriad phylogenetic sequencing data objects into R, it scrutinizes the data, making sure that the OTU Table matches the metadata mapping file, the phylogenetic tree, and the representative sequence labels. If not, the user gets an error. If the data descriptors are congruent, a single phyloseq object can be created, which can then be saved along with the R code used to create the object. I have found that this enables me to curate my data – consolidating all the data objects (OTU Table, mapping file, phylogenetic tree, etc.) describing a single experiment into a single multi-level data frame.

2. Phyloseq gives the user the analytical power of R

Importantly, by importing data into the R-computing environment, one may easily perform beta diversity analysis using any or all of over 40 different ecological distance metrics before performing virtually any ordination under the sun. Several alpha diversity metrics are implemented in phyloseq, as well. Finally, after getting the data into R, it’s easy to perform more sophisticated analyses than has previously been possible with this type of dataset, such as k-tables analysis (7), using R’s repertoire of extension packages.

3. Phyloseq makes standardization of sequence data pretty simple

Of particular note, the authors have included in phyloseq several methods to standardize and/or normalize high throughput sequence data. Most of us of course realize the need for data standardization (as evidenced by our reliance on rarefaction), but the tools to easily standardize data, aside from rarefaction, have been lacking (8). The authors of phyloseq have equipped us with several methods (one new!) to standardize our microbial census data, as well as the code needed to accomplish the task (https://github.com/joey711/phyloseq/wiki/Vignettes).

4. Phyloseq makes subsetting large datasets easy

One of my favorite uses for the phyloseq package is that it allows me to easily subset my dataset. In my work, I study the spatial variation of oral microbial communities. I have taken samples from all teeth from the mouths of just a handful of research subjects, but I have samples for certain teeth from all subjects. Phyloseq makes it easy for me to take a complete OTU Table, and subset it on only those teeth that were sampled in all subjects. Similarly, I can subset my OTU Table on a single bacterial phylum or on a single species, or on any covariate in my metadata mapping file, using a single line of R code.

5. Phyloseq enables the user to generate reproducible graphics

The authors of phyloseq created several custom ggplot2 (9) functions, enabling the phyloseq user, with just a few lines of code, to generate all of the most common graphics used in microbial census research (e.g., heatmaps, networks, ordination plots, phylogenetic trees, stacked bar plots for abundance measurements, etc.). Examples of these plots are shown in Figure 1 (though many other possibilities are supported, which can be seen here: http://joey711.github.io/phyloseq/tutorials-index).

Fig 1A. The NMDS ordination plot shows the separation of samples by weighted UniFrac distance for the Global Patterns dataset. Human-associated communities appear to cluster towards the right side of NMDS1 while non-human associated communities cluster towards the left.

Fig 1B. The dodged boxplots show three alpha diversity metrics (Observed species, Chao1, and ACE) on the Y-axis with data classified on the X-axis as either a human-associated or a non-human associated microbial community. This plot shows that non-human associated communities, in general, appear to be much more diverse than human-associated communities.

6. Phyloseq allows covariate data to be visualized with the phylogenetic tree

In particular, phyloseq solves very well the problem of visualizing the phylogenetic tree – it allows the user to project covariate data (such as sample habitat, host gender, etc.) onto the phylogenetic tree, so that relationships between microbes, microbial communities, and the habitat from which they were derived can easily be seen. As an example, the relative abundance of taxa in samples was projected onto the phylogenetic tree along with the environment from which the samples were derived along with the bacterial order in Figure 2. I’ve not seen any other application that allows similar visualizations of the tree, and bootstrapping is also supported. For additional examples, refer to the phyloseq tutorial (http://joey711.github.io/phyloseq/plot_tree-examples.html).

Fig 2: An example of using phyloseq to visualize phylogenetic trees along with covariate data using the Global Patterns dataset. In this figure, the sample type is shown in color, the shapes are bacterial Order, and the size of the shapes indicates the relative abundance of the taxon in the sample.

7. Data & code & results can be saved together to improve scientific reproducibility

One of the key features of phyloseq is that it provides researchers who have access to any system where R is able to run with a framework (i.e., R, R-markdown, Knitr & Rstudio) to perform reproducible statistical analysis of high throughput sequencing data. Using Phyloseq, Rstudio, R-markdown, and Knitr, it’s possible to see in a single .html file the data used to generate a set of figures alongside the code that was used to generate those figures. I now keep a collection of reproducible research reports as part of my lab notebook, and I look forward to being able to publish the final report for my first study along with my first scientific manuscript. For an example, please see the phyloseq tutorials, which were also generated using this approach (http://joey711.github.io/phyloseq/import-data.html).

8. Phyloseq is easy to learn.

When I first began working with Phyloseq, after taking a class taught by the author Susan Holmes, I knew some basic R commands from an undergraduate statistics class. Working with phyloseq made learning R easy for me. Since Phyloseq has a built in set of datasets one can use, it’s easy to reproduce the figures published in the phyloseq paper, as a stepping-stone for creating figures of one’s own.

Conclusion

An R-based package called Phyloseq makes it easy to analyze high throughput microbial census data, visualize the data, and perform reproducible statistical analysis. Phyloseq should facilitate conversations between researchers who publish data and the consumers of it with its emphasis on reproducible research. This should help those of us in the infancy of microbiome research ensure that our data do not disappear as quickly as Vine et al. currently predicts.

The paper and the phyloseq package are co-authored by Paul McMurdie and Professor Susan Holmes (in the picture).

REFERENCES

2.         Stromberg J. smithsonianmag.com: 2014.]. Available from: http://blogs.smithsonianmag.com/science/2013/12/the-vast-majority-of-raw-data-from-old-scientific-studies-may-now-be-missing/.

3.         Noorden EGRV. Scientists losing data at a rapid rate. Nature News. 2014.

4.         bobmanyun. reddit.com: 2014 1-7-2014. [cited 2014]. Available from: http://www.reddit.com/r/science/comments/1tb2d3/scientific_data_is_disappearing_at_alarming_rate/.

# Taking studies of regulatory evolution to the next level: translation

Carlo Artieri, a postdoc in the group of Hunter Fraser, wrote this blog post. The paper is written by Carlo and Hunter.

Carlo Artieri writes about his new paper: Evolution at two levels of gene expression in yeast which is in press in Genome Research.

Understanding the molecular basis of regulatory variation within and between species has become a major focus of modern genetics. For instance, the majority of identified human disease-risk alleles lie in non-coding regions of the genome, suggesting that they affect gene regulation (Epstein 2009). Furthermore, it has been argued that regulatory changes have played a dominant role in explaining uniquely human attributes (King and Wilson 1975). However, our knowledge of gene regulatory evolution is based almost entirely on studies of mRNA levels, despite both the greater functional importance of protein abundance, and evidence that post-transcriptional regulation is pervasive. The availability of high-throughput methods for measuring mRNA abundance, coupled to the lack of comparable methods at the protein level have contributed to this focus; however, a new method known as ribosome profiling, or ‘riboprofiling’ (Ingolia et al. 2009), has enabled us to study the evolution of translation in much greater detail than was possible before. This method involves the construction of two RNA-seq libraries: one measuring mRNA abundance (the ‘mRNA’ fraction), and the second capturing the portion of the transcriptome that is actively being translated by ribosomes (the ‘Ribo’ fraction). On average, the abundance of genes within the Ribo fraction should be proportional to that of the mRNA fraction. Genes with increased translational efficiency are identified when Ribo fraction abundance is higher than that of the mRNA fraction, whereas reduced translational efficiency is inferred when the opposite is observed.

## Riboprofiling of yeast hybrids

We performed riboprofiling on hybrids of two closely related species of budding yeast, Saccharomyces cerevisiae and S. paradoxus, (~5 million years diverged). In hybrids, the parental alleles at a locus share the same trans cellular environment; therefore in the absence of cis-regulatory divergence in transcription, both alleles should be expressed at equal levels. Conversely, cis-regulatory divergence will produce unequal expression of alleles (termed allele-specific expression, or ‘ASE’). Cis-regulatory divergence at the translational level is detected when ASE in the mRNA fraction does not equal that measured in the Ribo fraction, indicating independent divergence across levels. We also performed riboprofiling on the two parental strains, as differences in the expression of orthologs between parental species that cannot be explained by the allelic differences in the hybrids can be attributed to trans divergence. Therefore, by measuring differences in the magnitudes of ASE between the two riboprofiling fractions in the hybrids and the parents, we identified independent cis and trans regulatory changes in both mRNA abundance and translational efficiency.

## Results

We found that both cis and trans regulatory divergence in translational efficiency is widespread, and of comparable magnitude to divergence at the mRNA level – indicating that we miss much regulatory evolution by focusing on mRNA in isolation. Moreover, we observed an overwhelming bias towards divergence in opposing parental directions, indicating that while many orthologs had higher mRNA abundance in one parent, they often showed increased translational efficiency in the other parent. This suggests that stabilizing selection acts to maintain more similar protein levels between species than would be expected by comparing mRNA abundances alone.

## Translational divergence not associated with TATA boxes

Interestingly, while we confirmed the results of previous studies indicating that both cis and trans regulatory divergence at the mRNA level are associated with the presence of TATA boxes and nucleosome free regions in promoters, no such relationship was found for translational divergence, indicating that these regulatory systems have different underlying architectures.

## Evidence for polygenic selection at two levels

We also searched for evidence of polygenic selection in and between both regulatory levels by applying a recently developed modification of Orr’s sign test (Orr 1998; Fraser et al. 2010; Bullard et al. 2010). Under neutral divergence, no pattern is expected with regards to the parental direction of up or down-regulating alleles among orthologs within a functional group (e.g., a pathway or multi-gene complex). However, a significant bias towards one parental lineage is evidence of lineage-specific selection. This analysis uncovered evidence of polygenic selection at both regulatory levels in a number of functional groups. In particular, genes involved in tolerance to heavy metals were enriched for reinforcing divergence in mRNA abundance and translation favoring S. cerevisiae. Increased tolerance to these metals has been observed in S. cerevisiae (Warringer et al. 2011), suggesting that domesticated yeasts have experienced a history of polygenic adaptation across regulatory levels allowing them to grow on metals such as copper.

Finally, using data from the Ribo fraction, we also uncovered multiple instances of conserved stop-codon readthrough, a mechanism via which the ribosome ‘ignores’ the canonical stop codon and produces a C-terminally extended peptide. Only two cases of C-terminal extensions have previously been observed in yeast, though in one such case, PDE2, extension of the canonical protein plays a functional role in regulating cAMP levels (Namy et al. 2002). Our data suggests that this mechanism may occur in dozens of genes, highlighting yet another post-transcriptional mechanism leading to increased proteomic diversity.

## Conclusion

By applying a novel approach to a long-standing question, our analysis has revealed that post-transcriptional regulation is abundant, and likely as important as transcriptional regulation. We argue that partitioning the search for the locus of selection into the binary categories of ‘coding’ vs. ‘regulatory’ overlooks the many opportunities for selection to act at multiple regulatory levels along the path from genotype to phenotype.

## References

Bullard JH, Mostovoy Y, Dudoit S, Brem RB. 2010. Polygenic and directional regulatory evolution across pathways in Saccharomyces. Proc Natl Acad Sci USA 107: 5058-5063.

Epstein DJ. 2009. Cis-regulatory mutations in human disease. Brief Funct Genomic Proteomic 8: 310–316.

Fraser HB, Moses AM, Schadt EE. 2010. Evidence for widespread adaptive evolution of gene expression in budding yeast. Proc Natl Acad Sci USA 107: 2977-2982.

Ingolia NT, Ghaemmaghami S, Newman JR, Weissman JS. 2009. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324:218-223.

King MC, Wilson AC. 1975. Evolution at two levels in humans and chimpanzees. Science 188: 107-116.

Namy O, Duchateau-Nguyen G, Rousset JP. 2002. Translational readthrough of the PDE2 stop codon modulates cAMP levels in Saccharomyces cerevisiae. Mol Microbiol 43: 641-652.

Orr HA. 1998. Testing natural selection vs. genetic drift in phenotypic evolution using quantitative trait locus data. Genetics 149: 2099-2104.

Warringer J, Zörgö E, Cubillos FA, Zia A, Gjuvsland A, Simpson JT, Forsmark A, Durbin R, Omholt SW, Louis EJ, Liti G, Moses A, Blomberg A. 2011. Trait variation in yeast is defined by population history. PLoS Genet 7 :e1002111.