Using phyloseq for the reproducible analysis of high-throughput sequencing data in microbial ecology

Blog author Diana Proctor is a graduate student in David Relman’s lab.

Blog author Diana Proctor is a graduate student in David Relman’s lab.

The Problem: Data Availability & Scientific Reproducibility

A Current Biology (1) paper evaluating the accessibility of scientific data recently inspired articles and blog posts (2, 3) as well as a lively conversation on Reddit about the “alarming rate of data disappearance” (4). Solutions to the problem of disappearing data include the NIH data sharing policy, as well as data sharing policies set by scientific journals, requiring the deposition of data into public repositories.

As a trainee in David Relman’s Lab thinking about the eventual fate of the high-throughput, next generation sequencing data generated over the course of my dissertation (http://www.hyposalivation.com), this conversation about data accessibility brings to mind a related question – how can I ensure that my data are not lost as fast as the current biology paper predicts?

The Solution: Phyloseq allows microbial ecologists to make reproducible research reports

The solution to data disappearance probably needs to involve not only deposition of data into public repositories, but also the widespread use of reproducible research reports. Luckily for those microbial ecologists among us, Paul McMurdie and Susan Holmes of Stanford University developed an R-based Bioconductor package (i.e., a package for bioinformatics) called phyloseq to facilitate the reproducible statistical analysis of high throughput phylogenetic sequencing datasets, including those generated by barcoded amplicon sequencing, metabolomic, and metagenomic experiments (5, 6). Phyloseq, initially released in 2012, was recently updated by McMurdie & Holmes, and described in an April 2013 publication (6).

Phyloseq Key Features

Phyloseq allows the user to import a species x sample data matrix (aka, an OTU Table) or data matrices from metagenomic, metabolomic, and/or other –omics type experiments into the R computing environment. Previous R extensions, such as OTUbase, also have the capacity to import these data matrices into R, but phyloseq is unique in that it allows the user to integrate the OTU Table, the phylogenetic tree, the “representative sequence” fasta file, and the metadata mapping file into a single “phyloseq-class” R object. The microbial ecologist can then harness all the statistical and graphical tools available in R, including Knitr, R-Markdown and ggplot2 (among others), to generate reproducible research reports with beautiful graphics, as detailed below. To see the report McMurdie used to prepare the phyloseq publication, visit this link: http://www.hyposalivation.com/wp-content/uploads/2014/01/phyloseq_plos1_2012-source-doc.html.

1. Phyloseq incidentally allows the user to curate data

When phyloseq imports the myriad phylogenetic sequencing data objects into R, it scrutinizes the data, making sure that the OTU Table matches the metadata mapping file, the phylogenetic tree, and the representative sequence labels. If not, the user gets an error. If the data descriptors are congruent, a single phyloseq object can be created, which can then be saved along with the R code used to create the object. I have found that this enables me to curate my data – consolidating all the data objects (OTU Table, mapping file, phylogenetic tree, etc.) describing a single experiment into a single multi-level data frame.

2. Phyloseq gives the user the analytical power of R

Importantly, by importing data into the R-computing environment, one may easily perform beta diversity analysis using any or all of over 40 different ecological distance metrics before performing virtually any ordination under the sun. Several alpha diversity metrics are implemented in phyloseq, as well. Finally, after getting the data into R, it’s easy to perform more sophisticated analyses than has previously been possible with this type of dataset, such as k-tables analysis (7), using R’s repertoire of extension packages.

3. Phyloseq makes standardization of sequence data pretty simple

Of particular note, the authors have included in phyloseq several methods to standardize and/or normalize high throughput sequence data. Most of us of course realize the need for data standardization (as evidenced by our reliance on rarefaction), but the tools to easily standardize data, aside from rarefaction, have been lacking (8). The authors of phyloseq have equipped us with several methods (one new!) to standardize our microbial census data, as well as the code needed to accomplish the task (https://github.com/joey711/phyloseq/wiki/Vignettes).

4. Phyloseq makes subsetting large datasets easy

One of my favorite uses for the phyloseq package is that it allows me to easily subset my dataset. In my work, I study the spatial variation of oral microbial communities. I have taken samples from all teeth from the mouths of just a handful of research subjects, but I have samples for certain teeth from all subjects. Phyloseq makes it easy for me to take a complete OTU Table, and subset it on only those teeth that were sampled in all subjects. Similarly, I can subset my OTU Table on a single bacterial phylum or on a single species, or on any covariate in my metadata mapping file, using a single line of R code.

5. Phyloseq enables the user to generate reproducible graphics

The authors of phyloseq created several custom ggplot2 (9) functions, enabling the phyloseq user, with just a few lines of code, to generate all of the most common graphics used in microbial census research (e.g., heatmaps, networks, ordination plots, phylogenetic trees, stacked bar plots for abundance measurements, etc.). Examples of these plots are shown in Figure 1 (though many other possibilities are supported, which can be seen here: http://joey711.github.io/phyloseq/tutorials-index).

Fig 1A. The NMDS ordination plot  shows the separation of samples by weighted UniFrac distance for the Global Patterns dataset. Human-associated communities appear to cluster towards the right side of NMDS1 while non-human associated communities cluster towards the left.

Fig 1A. The NMDS ordination plot shows the separation of samples by weighted UniFrac distance for the Global Patterns dataset. Human-associated communities appear to cluster towards the right side of NMDS1 while non-human associated communities cluster towards the left.

Fig 1B. The dodged boxplots  show three alpha diversity metrics (Observed species, Chao1, and ACE) on the Y-axis with data classified on the X-axis as either a human-associated or a non-human associated microbial community.  This plot shows that non-human associated communities, in general, appear to be much more diverse than human-associated communities.

Fig 1B. The dodged boxplots show three alpha diversity metrics (Observed species, Chao1, and ACE) on the Y-axis with data classified on the X-axis as either a human-associated or a non-human associated microbial community. This plot shows that non-human associated communities, in general, appear to be much more diverse than human-associated communities.

6. Phyloseq allows covariate data to be visualized with the phylogenetic tree

In particular, phyloseq solves very well the problem of visualizing the phylogenetic tree – it allows the user to project covariate data (such as sample habitat, host gender, etc.) onto the phylogenetic tree, so that relationships between microbes, microbial communities, and the habitat from which they were derived can easily be seen. As an example, the relative abundance of taxa in samples was projected onto the phylogenetic tree along with the environment from which the samples were derived along with the bacterial order in Figure 2. I’ve not seen any other application that allows similar visualizations of the tree, and bootstrapping is also supported. For additional examples, refer to the phyloseq tutorial (http://joey711.github.io/phyloseq/plot_tree-examples.html).

Fig 2: An example of using phyloseq to visualize phylogenetic trees along with covariate data using the Global Patterns dataset. In this figure, the sample type is shown in color, the shapes are bacterial Order, and the size of the shapes indicates the relative abundance of the taxon in the sample.

Fig 2: An example of using phyloseq to visualize phylogenetic trees along with covariate data using the Global Patterns dataset. In this figure, the sample type is shown in color, the shapes are bacterial Order, and the size of the shapes indicates the relative abundance of the taxon in the sample.

7. Data & code & results can be saved together to improve scientific reproducibility

One of the key features of phyloseq is that it provides researchers who have access to any system where R is able to run with a framework (i.e., R, R-markdown, Knitr & Rstudio) to perform reproducible statistical analysis of high throughput sequencing data. Using Phyloseq, Rstudio, R-markdown, and Knitr, it’s possible to see in a single .html file the data used to generate a set of figures alongside the code that was used to generate those figures. I now keep a collection of reproducible research reports as part of my lab notebook, and I look forward to being able to publish the final report for my first study along with my first scientific manuscript. For an example, please see the phyloseq tutorials, which were also generated using this approach (http://joey711.github.io/phyloseq/import-data.html).

8. Phyloseq is easy to learn.

When I first began working with Phyloseq, after taking a class taught by the author Susan Holmes, I knew some basic R commands from an undergraduate statistics class. Working with phyloseq made learning R easy for me. Since Phyloseq has a built in set of datasets one can use, it’s easy to reproduce the figures published in the phyloseq paper, as a stepping-stone for creating figures of one’s own.

Conclusion

An R-based package called Phyloseq makes it easy to analyze high throughput microbial census data, visualize the data, and perform reproducible statistical analysis. Phyloseq should facilitate conversations between researchers who publish data and the consumers of it with its emphasis on reproducible research. This should help those of us in the infancy of microbiome research ensure that our data do not disappear as quickly as Vine et al. currently predicts.

The paper was co-authored by Paul McMurdie and Professor Susan Holmes (in the picture).

The paper and the phyloseq package are co-authored by Paul McMurdie and Professor Susan Holmes (in the picture).

REFERENCES

1.         Vines TH, Albert AY, Andrew RL, Debarre F, Bock DG, Franklin MT, et al. The Availability of Research Data Declines Rapidly with Article Age. Current biology : CB. 2013. doi: 10.1016/j.cub.2013.11.014. PubMed PMID: 24361065.

2.         Stromberg J. smithsonianmag.com: 2014.]. Available from: http://blogs.smithsonianmag.com/science/2013/12/the-vast-majority-of-raw-data-from-old-scientific-studies-may-now-be-missing/.

3.         Noorden EGRV. Scientists losing data at a rapid rate. Nature News. 2014.

4.         bobmanyun. reddit.com: 2014 1-7-2014. [cited 2014]. Available from: http://www.reddit.com/r/science/comments/1tb2d3/scientific_data_is_disappearing_at_alarming_rate/.

5.         McMurdie PJ, Holmes S. Phyloseq: a bioconductor package for handling and analysis of high-throughput phylogenetic sequence data. Pac Symp Biocomput. 2012:235-46. Epub 2011/12/17. PubMed PMID: 22174279; PubMed Central PMCID: PMC3357092.

6.         McMurdie PJ, Holmes S. phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS One. 2013;8(4):e61217. doi: 10.1371/journal.pone.0061217. PubMed PMID: 23630581; PubMed Central PMCID: PMC3632530.

7.         Thioulouse J. Simultaneous analysis of a sequence of paired ecological tables: A comparison of several methods. Ann Appl Stat. 2011;5(4):2300-25. doi: 10.1214/10-AOAS372.

8.         Holmes PJMS. Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissible. ARXIV. 2013. Epub 10/2013.

9.         Wickham H. ggplot2: Elegant Graphics for Data Analysis: Springer; 2009. 213 p.

Advertisement

The fruit fly and its microbiome

PhilippMesser

Philipp Messer is a research associate in the Petrov lab

This post was written by Philipp Messer.

Although fruit flies are one of the most important model organisms in genetics, evolution, and immunology, surprisingly little is known about their associated microorganisms (their microbiome). This is even the more surprising if you consider that the microbiome can strongly affect quantitative traits in flies, for example their growth rate and cold tolerance. Furthermore, the natural environment of fruit flies – rotting fruit – is very rich in microorganisms.

All organisms interact with associated microbes

Because microbes can influence the phenotype of organisms, we expect such interactions to be subject to natural selection. Genes involved in pathogen defense are indeed amongst the fastest evolving genes. But interactions with microbes do not always just lead to an evolutionary arms race between microbes and their hosts, they can also facilitate major evolutionary innovations. Prominent examples of such innovations are the light organ of the bobtail squid that arose through a symbiotic relationship between squids and bioluminescent bacteria, or cellulose digestion in termites which relies on microbes in their guts. Hence, to improve our understanding of the evolution of fruit flies, we need to better understand how they interact and coevolve with their associated microorganisms.

In their paper “Host species and environmental effects on bacterial communities associated with Drosophila in the laboratory and in the natural environment”, Fabian Staubach and his colleagues at Stanford and the Max Planck Institute for Evolutionary Biology in Plön shed light on some of the major questions regarding Drosophila associated microbes. Beyond finding out which bacteria are present in flies, they assess the relative roles of host species and environmental effects on bacterial communities, detect candidate natural pathogens, and find interesting results regarding lab-of-origin-effects on the fly microbial community.

The microbiome of fruitflies

We need more studies like this

These results are not only highly relevant for everyone working with Drosophila, but are also a strong reminder that we cannot understand any model organism without taking its associated microbiota into account. We therefore need more microbiome studies like that of Staubach et al to identify the microbes that coevolve with their hosts and understand how the genomes of hosts and microbes interact in the evolutionary process. I would not be surprised if interactions between microbes and their hosts turn out to be among the biggest selective forces in many organisms.

The paper is a fun and easy read and can be found at here. Fabian was a postdoc in the Petrov lab from 2010 to 2013 and has just moved to the University of Freiburg in Germany to start his own group, where he plans to follow his interest to deepen our understanding of the role of microbes in adaptation.

Citation: Staubach F, Baines JF, Künzel S, Bik EM, Petrov DA (2013) Host Species and Environmental Effects on Bacterial Communities Associated with Drosophila in the Laboratory and in the Natural Environment. PLoS ONE 8(8): e70749. doi:10.1371/journal.pone.0070749

FabianStaubach

Fabian Staubach studies the microbiome of fruitflies.