Blog author Diana Proctor is a graduate student in David Relman’s lab.
The Problem: Data Availability & Scientific Reproducibility
A Current Biology (1) paper evaluating the accessibility of scientific data recently inspired articles and blog posts (2, 3) as well as a lively conversation on Reddit about the “alarming rate of data disappearance” (4). Solutions to the problem of disappearing data include the NIH data sharing policy, as well as data sharing policies set by scientific journals, requiring the deposition of data into public repositories.
As a trainee in David Relman’s Lab thinking about the eventual fate of the high-throughput, next generation sequencing data generated over the course of my dissertation (http://www.hyposalivation.com), this conversation about data accessibility brings to mind a related question – how can I ensure that my data are not lost as fast as the current biology paper predicts?
The Solution: Phyloseq allows microbial ecologists to make reproducible research reports
The solution to data disappearance probably needs to involve not only deposition of data into public repositories, but also the widespread use of reproducible research reports. Luckily for those microbial ecologists among us, Paul McMurdie and Susan Holmes of Stanford University developed an R-based Bioconductor package (i.e., a package for bioinformatics) called phyloseq to facilitate the reproducible statistical analysis of high throughput phylogenetic sequencing datasets, including those generated by barcoded amplicon sequencing, metabolomic, and metagenomic experiments (5, 6). Phyloseq, initially released in 2012, was recently updated by McMurdie & Holmes, and described in an April 2013 publication (6).
Phyloseq Key Features
Phyloseq allows the user to import a species x sample data matrix (aka, an OTU Table) or data matrices from metagenomic, metabolomic, and/or other –omics type experiments into the R computing environment. Previous R extensions, such as OTUbase, also have the capacity to import these data matrices into R, but phyloseq is unique in that it allows the user to integrate the OTU Table, the phylogenetic tree, the “representative sequence” fasta file, and the metadata mapping file into a single “phyloseq-class” R object. The microbial ecologist can then harness all the statistical and graphical tools available in R, including Knitr, R-Markdown and ggplot2 (among others), to generate reproducible research reports with beautiful graphics, as detailed below. To see the report McMurdie used to prepare the phyloseq publication, visit this link: http://www.hyposalivation.com/wp-content/uploads/2014/01/phyloseq_plos1_2012-source-doc.html.
1. Phyloseq incidentally allows the user to curate data
When phyloseq imports the myriad phylogenetic sequencing data objects into R, it scrutinizes the data, making sure that the OTU Table matches the metadata mapping file, the phylogenetic tree, and the representative sequence labels. If not, the user gets an error. If the data descriptors are congruent, a single phyloseq object can be created, which can then be saved along with the R code used to create the object. I have found that this enables me to curate my data – consolidating all the data objects (OTU Table, mapping file, phylogenetic tree, etc.) describing a single experiment into a single multi-level data frame.
2. Phyloseq gives the user the analytical power of R
Importantly, by importing data into the R-computing environment, one may easily perform beta diversity analysis using any or all of over 40 different ecological distance metrics before performing virtually any ordination under the sun. Several alpha diversity metrics are implemented in phyloseq, as well. Finally, after getting the data into R, it’s easy to perform more sophisticated analyses than has previously been possible with this type of dataset, such as k-tables analysis (7), using R’s repertoire of extension packages.
3. Phyloseq makes standardization of sequence data pretty simple
Of particular note, the authors have included in phyloseq several methods to standardize and/or normalize high throughput sequence data. Most of us of course realize the need for data standardization (as evidenced by our reliance on rarefaction), but the tools to easily standardize data, aside from rarefaction, have been lacking (8). The authors of phyloseq have equipped us with several methods (one new!) to standardize our microbial census data, as well as the code needed to accomplish the task (https://github.com/joey711/phyloseq/wiki/Vignettes).
4. Phyloseq makes subsetting large datasets easy
One of my favorite uses for the phyloseq package is that it allows me to easily subset my dataset. In my work, I study the spatial variation of oral microbial communities. I have taken samples from all teeth from the mouths of just a handful of research subjects, but I have samples for certain teeth from all subjects. Phyloseq makes it easy for me to take a complete OTU Table, and subset it on only those teeth that were sampled in all subjects. Similarly, I can subset my OTU Table on a single bacterial phylum or on a single species, or on any covariate in my metadata mapping file, using a single line of R code.
5. Phyloseq enables the user to generate reproducible graphics
The authors of phyloseq created several custom ggplot2 (9) functions, enabling the phyloseq user, with just a few lines of code, to generate all of the most common graphics used in microbial census research (e.g., heatmaps, networks, ordination plots, phylogenetic trees, stacked bar plots for abundance measurements, etc.). Examples of these plots are shown in Figure 1 (though many other possibilities are supported, which can be seen here: http://joey711.github.io/phyloseq/tutorials-index).
Fig 1A. The NMDS ordination plot shows the separation of samples by weighted UniFrac distance for the Global Patterns dataset. Human-associated communities appear to cluster towards the right side of NMDS1 while non-human associated communities cluster towards the left.
Fig 1B. The dodged boxplots show three alpha diversity metrics (Observed species, Chao1, and ACE) on the Y-axis with data classified on the X-axis as either a human-associated or a non-human associated microbial community. This plot shows that non-human associated communities, in general, appear to be much more diverse than human-associated communities.
6. Phyloseq allows covariate data to be visualized with the phylogenetic tree
In particular, phyloseq solves very well the problem of visualizing the phylogenetic tree – it allows the user to project covariate data (such as sample habitat, host gender, etc.) onto the phylogenetic tree, so that relationships between microbes, microbial communities, and the habitat from which they were derived can easily be seen. As an example, the relative abundance of taxa in samples was projected onto the phylogenetic tree along with the environment from which the samples were derived along with the bacterial order in Figure 2. I’ve not seen any other application that allows similar visualizations of the tree, and bootstrapping is also supported. For additional examples, refer to the phyloseq tutorial (http://joey711.github.io/phyloseq/plot_tree-examples.html).
Fig 2: An example of using phyloseq to visualize phylogenetic trees along with covariate data using the Global Patterns dataset. In this figure, the sample type is shown in color, the shapes are bacterial Order, and the size of the shapes indicates the relative abundance of the taxon in the sample.
7. Data & code & results can be saved together to improve scientific reproducibility
One of the key features of phyloseq is that it provides researchers who have access to any system where R is able to run with a framework (i.e., R, R-markdown, Knitr & Rstudio) to perform reproducible statistical analysis of high throughput sequencing data. Using Phyloseq, Rstudio, R-markdown, and Knitr, it’s possible to see in a single .html file the data used to generate a set of figures alongside the code that was used to generate those figures. I now keep a collection of reproducible research reports as part of my lab notebook, and I look forward to being able to publish the final report for my first study along with my first scientific manuscript. For an example, please see the phyloseq tutorials, which were also generated using this approach (http://joey711.github.io/phyloseq/import-data.html).
8. Phyloseq is easy to learn.
When I first began working with Phyloseq, after taking a class taught by the author Susan Holmes, I knew some basic R commands from an undergraduate statistics class. Working with phyloseq made learning R easy for me. Since Phyloseq has a built in set of datasets one can use, it’s easy to reproduce the figures published in the phyloseq paper, as a stepping-stone for creating figures of one’s own.
An R-based package called Phyloseq makes it easy to analyze high throughput microbial census data, visualize the data, and perform reproducible statistical analysis. Phyloseq should facilitate conversations between researchers who publish data and the consumers of it with its emphasis on reproducible research. This should help those of us in the infancy of microbiome research ensure that our data do not disappear as quickly as Vine et al. currently predicts.
The paper and the phyloseq package are co-authored by Paul McMurdie and Professor Susan Holmes (in the picture).
1. Vines TH, Albert AY, Andrew RL, Debarre F, Bock DG, Franklin MT, et al. The Availability of Research Data Declines Rapidly with Article Age. Current biology : CB. 2013. doi: 10.1016/j.cub.2013.11.014. PubMed PMID: 24361065.
2. Stromberg J. smithsonianmag.com: 2014.]. Available from: http://blogs.smithsonianmag.com/science/2013/12/the-vast-majority-of-raw-data-from-old-scientific-studies-may-now-be-missing/.
3. Noorden EGRV. Scientists losing data at a rapid rate. Nature News. 2014.
4. bobmanyun. reddit.com: 2014 1-7-2014. [cited 2014]. Available from: http://www.reddit.com/r/science/comments/1tb2d3/scientific_data_is_disappearing_at_alarming_rate/.
5. McMurdie PJ, Holmes S. Phyloseq: a bioconductor package for handling and analysis of high-throughput phylogenetic sequence data. Pac Symp Biocomput. 2012:235-46. Epub 2011/12/17. PubMed PMID: 22174279; PubMed Central PMCID: PMC3357092.
6. McMurdie PJ, Holmes S. phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS One. 2013;8(4):e61217. doi: 10.1371/journal.pone.0061217. PubMed PMID: 23630581; PubMed Central PMCID: PMC3632530.
7. Thioulouse J. Simultaneous analysis of a sequence of paired ecological tables: A comparison of several methods. Ann Appl Stat. 2011;5(4):2300-25. doi: 10.1214/10-AOAS372.
8. Holmes PJMS. Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissible. ARXIV. 2013. Epub 10/2013.
9. Wickham H. ggplot2: Elegant Graphics for Data Analysis: Springer; 2009. 213 p.