Computational, evolutionary and human genomics at Stanford

Uncovering functional variation in humans by genome and transcriptome sequencing

Tuuli_chamonix2
Tuuli Lappalainen is a postdoc in the Bustamante lab

Tuuli Lappalainen is currently a postdoc in the Bustamante lab, but the work described here was done when she was working at the University of Geneva. In January 2014 she will start her own lab as a faculty member of the New York Genome Center. This post was previously published on Genomes Unzipped

In a paper published in Nature in September, we describe results of the largest study to date integrating RNA and genome sequencing data from multiple human populations, and provide a comprehensive map of how genetic variation affects the transcriptome. This was achieved by RNA-sequencing of individuals that are part of the 1000 Genomes sample set, thus adding a functional dimension to the most important catalogue of human genomes. In this blog post I will discuss how our findings shed light on genetic associations to disease.

As genome-wide studies are providing an increasingly comprehensive catalog of genetic variants that predispose to various diseases, we are faced with a huge challenge: what do these variants actually do in the cell? Understanding the biological mechanisms underlying diseases is essential to develop interventions, but traditional molecular biology follow-up is not really feasible for the thousands of discovered GWAS loci. Thus, we need high-throughput approaches for measuring genetic effects at the cellular level, which is an intermediate between the genome and the disease. The cellular trait most amenable for such analysis is the transcriptome, which we can now measure reliably and robustly by RNA-sequencing (as shown by our companion paper in Nature Biotechnology).

In this project, several European institutes of the Geuvadis Consortium sequenced mRNA and small RNA from lymphoblast cell lines from 465 individuals that are in the 1000 Genomes sample set. The idea of gene expression analysis of genetic reference samples is not new (see e.g. papers by Stranger et al., Pickrell et al. and Montgomery et al.), but the bigger scale and better quality that is now possible enables the discovery of exciting new biology, as demonstrated by other recent RNA-seq papers as well (e.g. Battle et al.Gutierrez-Arcelus et al.).

Regulatory variants underlying GWAS signals

Our first striking observation was that over one half of measured genes are affected by common genetic variation in human populations – called expression quantitative trait loci or eQTLs. Regulatory associations are not like GWAS studies where you are lucky to find a handful of significant hits; regulatory variation is literally (almost) everywhere – it’s the rule, not the exception.

The vast majority these regulatory variants won’t have any effect on the phenotype at the individual level, but some of them do. The first obvious question was how many known GWAS variants are eQTLs in our study, and indeed pretty many of them are – 16%. So does this prove that in all these GWAS regions we have identified the regulatory change as the cellular mechanism that drives the disease? Unfortunately the answer is no. Regulatory associations are so common that the expected overlap just by chance is as high as 11%. This means that your favorite GWAS variant having a significant regulatory association is very far from sufficient proof of it being the biological mechanism of the disease or trait. The same applies to overlap with, for example, ENCODE annotations, by the way. This is not overcautious small print. We’ve basically reversed the problem of having hardly any clue of functional mechanisms to having too many putative functions. We’ve found the haystack.

How can we solve this problem? Luckily, there are statistical methods to analyze the two association signals in the same genomic region to find out if the gene expression association is likely to be causal to the disease association. You still can’t be 100% sure, but that is much smaller print. And we do find an enrichment of such a signal, as in previous studies – telling us that regulatory changes are enriched for being causal biological mechanisms underlying GWAS signals.

From associated regions to causal variants

We can take this analysis an important step further to pinpoint likely causal variants. Thus far, nearly all association studies have used data from SNP arrays that measure only a subset of all the common variants. This works fine for identifying more or less broad regions of the genome that have a variant somewhere that changes the function of the genome such that it predisposes to the trait in question. However, usually there’s no clue what the precise causal variant is and what its exact properties are.

The first step in finding the causal variants is getting genome sequencing data, which is what we have in our study. We show that we have pretty good power to pinpoint causal regulatory variants in many of the loci, which is great news for understanding mechanisms of genome regulation. This has a cool application for dozens of GWAS loci that are driven by a regulatory association: by discovering the putative causal regulatory variant from our association data, we’re at the same time pinpointing the likely causal GWAS variant as well. Thus, combining genome sequencing and cellular phenotype data can give us information not only of the biological mechanisms underlying GWAS associations, but also identify the likely causal variants.

GenomesUnzippedFigure
The example of the DGKD gene illustrates the power of our eQTL data to map causal functional variants. The plots show the eQTL association landscape, where the top eQTLvariant rs201966773 is the most likely causal variant. This variant is a 2bp insertion that is not genotyped by any SNP arrays, and overlaps several regulatory elements close to the transcript start site of some of the transcripts of the gene. The rs838705 variant marked in red is a GWAS variant associated to calcium levels – and the eQTL analysis suggests that rs201966773 is the most likely causal variant for this association signal.

Conclusions and future perspectives

In this study we have integrated genome and transcriptome sequencing data to understand the landscape of functional variation in human populations. In addition to our scientific discoveries, this is an extremely valuable open-access data set for the human genetics community, as it links directly to the 1000 Genomes data that is used by nearly all human genetics projects. Since our pre-publication data release in November 2012, the data set has already been downloaded thousands of times, and we’ve put a lot of effort into open data sharing by having a browser and even opening our project wiki for the public.

This paper is a big step forward, but we’re still far from a full understanding of how genetic variation affects the transcriptome and how this affects human disease. One important challenge is to understand the cellular effects of rare and loss-of-function variants, which we address only briefly in this paper. Furthermore, other projects such as GTEx are describing transcriptome variation and its genetic causes in a large variety of human tissues. Many of the contributors of the Nature paper are participating in this work as well, including myself, and the Bustamante, Montgomery and Koller labs from Stanford – so stay tuned.

This study and other projects that analyze cellular phenotypes in the general human population are providing the baseline of the general population spectrum of functional genetic variation and transcriptome variation, which is essential to be able to distinguish the cases where things go wrong and cause disease. At the same time, as we move forward with basic research, it is important to push for clinical applications to target cellular perturbations leading to disease, and develop approaches for personalized transcriptomics to better interpret personalized genomes.

Reference: Tuuli Lappalainen, Michael Sammeth, Marc R. Friedländer, Peter A. C. ‘t Hoen, Jean Monlong, Manuel A. Rivas, Mar Gonzàlez-Porta, Natalja Kurbatova, Thasso Griebel, Pedro G. Ferreira, Matthias Barann, Thomas Wieland, Liliana Greger, Maarten van Iterson, Jonas Almlöf, Paolo Ribeca, Irina Pulyakhina, Daniela Esser, Thomas Giger, Andrew Tikhonov, Marc Sultan, Gabrielle Bertier, Daniel G. MacArthur, Monkol Lek, Esther Lizano, Henk P. J. Buermans, Ismael Padioleau, Thomas Schwarzmayr, Olof Karlberg, Halit Ongen, Helena Kilpinen, Sergi Beltran, Marta Gut, Katja Kahlem, Vyacheslav Amstislavskiy, Oliver Stegle, Matti Pirinen, Stephen B. Montgomery, Peter Donnelly, Mark I. McCarthy, Paul Flicek, Tim M. Strom, The Geuvadis Consortium, Hans Lehrach, Stefan Schreiber, Ralf Sudbrak, Ángel Carracedo, Stylianos E. Antonarakis, Robert Häsler, Ann-Christine Syvänen, Gert-Jan van Ommen, Alvis Brazma, Thomas Meitinger, Philip Rosenstiel, Roderic Guigó, Ivo G. Gut, Xavier Estivill & Emmanouil T. Dermitzakis (2013). Transcriptome and genome sequencing uncovers functional variation in humans. Nature, 501 (7468), 506-11 PMID: 24037378