Learning from 69 sequenced Y chromosomes

Why the Y?

Blog author Amy Goldberg is a graduate students in Noah Rosenberg's lab.

Blog author Amy Goldberg is a graduate students in Noah Rosenberg’s lab.

While mitochondria have been extensively sequenced for decades because of their short length and abundance, the Y chromosome has been under-studied.  Unlike autosomal DNA, the mitochondria and (most of) the Y chromosome are inherited exclusively maternally and paternally, respectively.  Therefore, they do not undergo meiotic recombination.  Without recombination, mutations accumulate on a stable background, preserving a wealth of information about population history.  Each background, shared through a common ancestor, is called a haplogroup. To leverage this information, Poznik et al. set out to sequence 69 males from nine diverse human populations, including a large representation of African individuals.  The paper, published in Science last summer, is by Stanford graduate student David Poznik and a group lead by CEHG professor Dr. Carlos Bustamante.

The structure of the Y chromosome is complex, with large heterochromatic regions, pseudo-autosomal regions that recombine with the X chromosome, and repetitive elements, making mapping reads difficult.  But, the Y chromosome is haploid, allowing for accurate variant calls at lower coverage than the autosomes, which have heterozygotes.  Using high-throughput sequencing (3.1x mean coverage) and a haploid expectation-maximization algorithm, Poznik et al. called genotypes with an error rate around 0.1%. The paper developed important methods for analyzing high-throughput sequences of the difficult Y chromosome, including determining the subset of regions within which accurate genotypes can be called.

Reconstructing the human Y-chromosome tree

Poznik et al. constructed a phylogenetic tree of the Y chromosome using sequence data and a maximum likelihood approach.  While the overall structure of the tree was known, Poznik et al. were able to accurately calculate branch lengths based on the number of variants differing between individuals and resolve previously indeterminate finer structure.

Figure 2 of the paper: Y-chromosome phylogeny inferred from genomic sequencing.

Figure 2 of the paper: Y-chromosome phylogeny inferred from genomic sequencing.

Incredible African Diversity: One of the key findings of the paper was the depth of diversity within Africans lineages.  While both uniparental and autosomal markers have indicated an African root for human diversity, Poznik et al. find lineages within a single population, the San hunter-gatherers, that coalesce almost at the same time as the entire tree (see haplogroup A). This indicates African diversity and structure has existed for tens of thousands of years, and there is likely more to discover.  A large sample of African populations were considered, which lead to previously unseen structure within haplogroup B2, including structure not mirrored by modern population clustering, that dates to approximately 35,000 years ago.

Evidence of population expansionShort internal branches of the tree, such as those seen within haplogroup E and the non-African group FT, indicate periods of rapid population growth.  When a population expands quickly, new variants that might otherwise drift to extinction can persist.  A large number of coalescence events occur at the time of growth, as there were fewer lineages alive in the population before this time.  For non-African haplogroups, this pattern is likely a remnant of the Out of Africa migration.  For haplogroup E, this corresponds to the Bantu agricultural expansion.

Resolved Eurasian polytomy: Previously, the topology of the Eurasian tree separating haplogroups G-H-IJK was unresolved.  Because of the higher coverage sequencing for this study, Poznik et al. found a single variant, a C to T transition, that differentiates G from the other groups.  Haplogroup G retains the ancestral variant, while H-IJK share the derived variant and are therefore more closely related to each other.

Sequencing vs. genotyping

In contrast to previous studies, which analyzed small repetitive elements called microsatellites or small sets of single base-pair changes called SNPs, whole-genome sequencing data contains not only more information, but potentially more accurate information.  In particular, before the advent of high-throughput sequencing, SNPs were usually ascertained in a subset of individuals that did not capture worldwide diversity levels.  Therefore, diversity measures are often underestimated and biased.  Without sequence data, the branch lengths of the tree did not have a meaningful interpretation, and the depth of variation within Africa was not seen.

MRCA of Human Maternal and Paternal Lineages

There was a lot of public discussion spurred by the publication of Poznik’s paper last year.  The discussion mainly focused on their result that, contrary to previous estimates, the most recent common ancestor (MRCA) of all mitochondrial DNA lived at a similar time as that of all Y chromosomes.  Previous estimates put the mitochondrial TMRCA around 200 thousand years ago, with the Y chromosome coalescing a bit over 100 thousand years ago.  These different estimates for Y and mitochondria were often obtained through different sequencing and analysis methods, and are therefore less comparable.  In particular, varying estimates of the mutation rates have led to different TMRCA estimates.  By analyzing both the Y and mitochondria in the same framework, calibrated by archeological evidence and within-species comparisons, Poznik et al. found largely overlapping confidence intervals for the TMRCA of both Y and mitochondria.

But, should the coalescence times of the mitochondria and the Y chromosome be the same? Not necessarily.  While discrepancies between the mitochondria and Y chromosome have often been interpreted as sex-biased population histories or sizes, strictly neutral models can predict large differences between the two, as well.  Because neither the analyzed part of the Y chromosome nor the mitochondria undergo recombination, each acts as a single locus – and therefore represents the history of a single lineage.  For a population, there is a wide distribution of the ages when lineages would coalesce for a given population history, and these loci represent only two with largely independent histories (given the overall population history), therefore they may differ by chance alone.  Similarly, different loci across autosomal DNA have TMRCA ranging from thousands to millions of years. Additionally, as single loci, any effects of selection would distort the entire genealogy of the Y chromosome and mitochondria.

Future directions

Human population history is far from fully fleshed out, and Poznik et al. provide a framework to leverage increasingly available high-throughput sequencing of Y chromosomes.  The method used to calculate the mutation rate and TMRCA is a valuable contribution in itself, with applications to a wide range of evolutionary and ecological questions.  This study demonstrated that we have only characterized a fraction of worldwide diversity, particularly in Africa, and that increased sampling will be critical to parsing close and far ties in human history.


Poznik GD, Henn BM, Yee MC, Sliwerska E, Euskirchen GM, Lin AA, Snyder M, Quintana-Murci L, Kidd JM, Underhill PA, Bustamante CD. Sequencing Y chromosomes resolves discrepancy in time to common ancestor of males versus females. Science. 2013 Aug 2;341(6145):562-5. doi: 10.1126/science.1237619.

Paper author David Poznik is a PhD student in Carlos Bustamante's lab.

Paper author David Poznik is a PhD student in Carlos Bustamante’s lab.

Genomic analyses of ancestry of Caribbean populations

Blog author Rajiv McCoy is a graduate student in the lab of Dmitri Petrov.

Blog author Rajiv McCoy is a graduate student in the lab of Dmitri Petrov.

In the Author Summary of their paper, “Reconstructing the Population Genetic History of the Caribbean”, Andrés Moreno-Estrada and colleagues point out that Latinos are often falsely depicted as a homogeneous ethnic or cultural group.  In reality, however, Latinos, including inhabitants of the Caribbean basin, represent a diverse mixture of previously separate human populations, such as indigenous groups, European colonists, and West Africans brought over during the Atlantic slave trade.  This mixing process, which geneticists call “admixture”, left a distinct footprint on genetic variation within and between Caribbean populations.  By surveying genotypes of 330 Caribbean individuals and comparing to a database of variation from more than 3000 individuals from European, African, and Native American populations, Moreno et al., explore the genomic outcomes of this complex admixture process and reveal intriguing demographic patterns that could not be obtained from the historical record alone. The paper, featured in the latest edition of PLOS Genetics, represents a collaborative project with co-senior authorship by Stanford CEHG professor Carlos Bustamante and Professor Eden Martin from the University of Miami Miller School of Medicine.

Reconstructing the demographic history of admixed populations

Because parental DNA is only moderately shuffled before being incorporated into gametes (the process of meiotic recombination), admixture results in discrete genomic segments that can be traced to a particular ancestral population.  In early generations after the onset of admixture, these segments are large.  However, after many generations, segments will be quite small.  By investigating the distribution of sizes of these ancestry “tracts”, Moreno and colleagues inferred the timing of various waves of migration and admixture.  For Caribbean Island populations, they infer that European gene flow first occurred ~16-17 generations ago, which matches very closely to the historical record of ~500 years, assuming ~30 years per generation.  In contrast, for neighboring mainland populations from Colombia and Honduras, they find that European gene flow occurred in waves, starting more recently (~14 generations ago).

Identifying sub-continental ancestry of admixed individuals

Those familiar with human population genetics will recognize principal component analysis (PCA), which transforms a matrix of correlated observed genotypes into a set of uncorrelated variables where the first component explains the most possible variance, the second variable explains the second most variance, and so on.  Individuals’ transformed genotypes can be plotted on the first two principle components, and when performed on a worldwide scale, distinct clusters appear which represent populations of ancestry.  On conventional PCA plots, admixed individuals fall between their different ancestral populations, as they possess sets of genotypes diagnostic of multiple ancestral groups.  As virtually all Caribbean individuals are admixed to some degree, this pattern is apparent for Caribbean populations (see Figure 1B from the paper, reproduced below).


While interesting, this means that the sub-continental ancestry of these admixed individuals is difficult to ascertain.  An individual may want to know which Native American, West African, and European populations contribute to his or her ancestry, and this analysis does not have sufficient resolution to answer these questions.

Moreno and colleagues therefore devised a new version of PCA called ancestry-specific PCA (ASPCA), which extracts genomic segments assigned to Native American, West African, and European ancestry, then analyzes these segments separately, dealing with the large proportions of missing data that result.  In the case of Native American ASPCA, they observe two overlapping clusters.  The first represents mostly Colombians and Hondurans, who cluster most closely with indigenous groups from Western Colombia and Central America and have a greater overall proportion of Native American ancestry.  The second cluster represents mostly Cubans, Dominicans, and Puerto Ricans, who cluster most closely with Eastern Colombian and Amazonian indigenous groups.  This makes sense in light of the fact that Amazonian populations from the Lower Orinoco Valley settled on rivers and streams, which could have facilitated their migration.  Because indigenous ancestry proportions were relatively consistent and closely clustered across different Caribbean Islands, the authors posit that there was a single pulse of expansion of Amazonian natives across the Caribbean prior to European arrival, along with gene flow among the islands.

In the case of European ASPCA, Moreno et al. found that Caribbean samples clustered closest to, but clearly distinct from, present day individuals from the Iberian Peninsula in Southern Europe.  In fact, the differentiation between this “Latino-specific component” and Southern Europe is at least as great as the differentiation between Northern and Southern Europe.  The authors hypothesize that this is due to very small population sizes among European colonists, which would have introduced noise into patterns of genomic variation through the process of random genetic drift.

Finally, the authors demonstrate that Caribbean populations have a higher proportion of African ancestry compared to mainland American populations, a result of admixture during and after the Atlantic slave trade.  Surprisingly, the authors found that all samples tightly clustered with present day Yoruba samples from Nigeria rather than being dispersed throughout West Africa.  However, because other analyses suggested that there might have been two major waves of migration from West Africa, the authors decided to analyze “old” and “young” blocks of African ancestry separately.  This analysis revealed that “older” segments are primarily derived from groups from the Senegambia region of Northwest Africa, while “younger” segments likely trace to groups from the Gulf of Guinea and Equatorial West Africa (including the Yoruba).

Conclusions and perspectives

This groundbreaking study has immediate implications for the field of personalized medicine, especially due to the discovery of a distinct Latino-specific component of European ancestry.  The hypothesis that European colonists underwent a demographic bottleneck (a process termed the “founder effect”) has expected consequences for the frequency of damaging mutations contributing to genetic disease. The observation of extensive genetic differences among Caribbean populations also argues for more such studies characterizing genetic variation on a smaller geographic scale. The newly developed ASPCA method will surely be valuable for other admixed populations.  In addition to medical implications, studies such as this help dispel simplistic notions of race and ethnicity and inform cultural identities based on unique and complex demographic history.

Citation: Moreno-Estrada A, Gravel S, Zakharia F, McCauley JL, Byrnes JK, et al. (2013) Reconstructing the Population Genetic History of the Caribbean. PLoS Genet 9(11): e1003925. doi:10.1371/journal.pgen.1003925

Paper author Andres Moreno-Estrada is a research associate in the lab of Carlos Bustamante.

Paper author Andrés Moreno-Estrada is a research associate in the lab of Carlos Bustamante.