Saturday, March 29, 2008

Why do genome-wide scans fail?

The successes of genome-wide association studies (GWAS) in identifying genetic risk factors for common diseases have been heavily publicised in the mainstream media - barely a week goes by these days that we don't hear about another genome scan that has identified new risk genes for diabetes, lupus, cardiac disease, or any of the other common ailments of Western civilisation.

Some of this publicity is well-founded: for the first time in human history, we have the power to identify the precise genetic differences between human beings that contribute to variation in disease susceptibility. If we can document all of the factors, both genetic and environmental, that result in common disease we will be able to target early interventions to the individuals who are most susceptible. Every GWAS success brings us closer to the long-awaited era of personalised medicine.

But while the media trumpet the successes of genome scans, little attention is paid to their failures. The fact remains that despite the hundreds of millions of dollars spent on genome-wide association studies, most of the genetic variance in risk for most common diseases remains undiscovered. Indeed, some common diseases with a strong heritable component, such as bipolar disease, have remained almost completely resistant to GWAS.

Where is this heritable risk hiding? It now seems likely that it's lurking in a number of different places, with the fraction of the risk in each category varying from disease to disease. This post serves as a generic list of the dark regions of the genome currently inaccessible to GWAS, with some discussion of the techniques that will likely prove useful in mapping risk variants in these areas. I'll be referring back here over the next few months as I discuss both successful and unsuccessful gene-hunting expeditions in a variety of diseases. I consider this list a work in progress and would welcome any suggestions from readers on expanding and refining it.

Alleles with small effect sizes
The problem: The ability to simultaneously examine hundreds of thousands of variants throughout the genome is both the strength and the weakness of the GWAS approach. The power of GWAS is that they provide a relatively unbiased examination of the entire genome for common risk variants; their weakness is that in doing so, they swamp the signal from true risk variants with statistical noise from the vast numbers of markers that aren't associated with disease. To separate true signals from noise, researchers have to set an exceptionally high threshold that a marker needs to exceed before it is accepted as a likely disease-causing candidate. That reduces the problem of false positives, but it also means that any true disease markers with small effects are lost in the background noise.

The solution: This seems to be one problem that will need to be solved, at least to some extent, with sheer brute force. By increasing the numbers of samples in their disease and control groups researchers will steadily dial down the statistical noise from non-associated markers until even disease genes with small effects stand out above the crowd. As the cost of genotyping (and sequencing) tumbles ever downward such an approach will become more and more feasible; however, the logistical challenge of collecting large numbers of carefully-ascertained patients will always be a serious obstacle.

Rare variants
The problem: Current genome scan technology relies heavily on the "common disease, common variant" (CDCV) assumption, which states that the genetic risk for common disease is mostly attributable to a relatively small number of common genetic variants. This is largely an assumption of convenience: firstly, our catalogue of human genetic variation (built up by efforts such as the HapMap project) is largely restricted to common variants, since rare variants are much harder to identify; and secondly, chip-makers have restrictions on how many different SNPs they can analyse on a single chip, so the natural tendency has been to cram in the high-frequency variants that capture the largest proportion of genetic variation per probe. There is also some theoretical justification for this assumption based on models of human demographic history, but these models are themselves based on numerous assumptions, and the argument may not apply equally to all common human diseases.

In any case, everyone agrees that some non-trivial fraction of the genetic risk of common diseases will be the result of rare variants, and the latest results from GWAS in a variety of diseases have failed to provide unambiguous support for the CDCV hypothesis. Whatever the proportion of variance that turns out to be explained by rare variants, current GWAS technologies are essentially powerless to unravel it.

The solution: Increasing sample sizes may help a little, but the fundamental problem is the inability of current chips to tag rare variation. Short-term, the solution will be higher-density SNP chips incorporating lower frequency variants identified by large-scale sequencing projects like the 1000 Genomes Project. However, such approaches will have diminishing returns: as chip-makers lower the frequency of the variants on their chips, the number of probes that will have to be added to capture a reasonable fraction of total genetic variation will increase exponentially, with each new probe adding only a minute increase in power.

Ultimately, the answer lies in large-scale sequencing, which will provide a complete catalogue of every variant in the genomes of both patients and controls. The problem here is not so much the sequencing itself - the costs of sequencing are currently plummeting due to massive investment in rapid sequencing technologies - but in the interpretation. Whole new analytical techniques will be required to convert these data into useful information.

Population differences
The problem: Over the last 50 to 100 thousand years modern humans have enthusiastically colonised much of the world's landmass. Each wave of expansion has carried with it a fraction of the genetic variation of its ancestral population, along with a few novel variants acquired through mutation. In each new habitat encountered, natural selection has acted to increase the frequency of variants that provided an advantage, and cull those that were harmful, while the rest of the genome passively gained and lost genetic variation. The end result is a set of human populations that, while extremely similar across the genome as a whole, can carry quite different sets of genetic variants relevant to disease. In addition, the correlation between markers close together in the genome (known as linkage disequilibrium) can also differ between populations, so that a marker that is tightly correlated with a disease variant in one population may be only weakly associated in other groups.

These differences have profound implications for disease gene mapping efforts. As a result of this variation, markers that are associated with disease in one population can never be assumed to show the same associations in other human groups (this will be especially true for rare variants, of course). Current GWAS have been dominated by subjects of Western European ancestry, and our understanding of genetic risk variants in non-European populations is almost non-existent. In addition, these differences mean that mixing people with different ancestries together in a disease cohort can seriously confound the identification of causative genes - in certain situations, such mixing can greatly increase the risk of false positive findings.

The solution: For GWAS results to be universally applicable, they will need to be performed in cohorts from a wide range of populations. Data-sets such as the HapMap project, the Human Genome Diversity Panel and the powerful new 1000 Genomes Project will provide information about the patterns of genetic variation in diverse populations that is needed to design the assays for GWAS. A greater challenge will be collecting the large numbers of ancestry-homogeneous samples - both well-validated disease patients and healthy controls - required for GWAS approaches to be successful. This problem is likely to be particularly acute for African populations, where linkage disequilibrium is lower and genetic diversity much higher than in other regions (thus requiring larger numbers of markers and individuals to identify disease variants); and of course, in Africa and much of the rest of the world, local governments typically have much more pressing issues than genome scans to spend their limited health budgets on.

Epistatic interactions
The problem: Most current genetic approaches assume that genetic risk is additive - in other words, that the presence of two risk factors in an individual will increase risk by the sum of the two factors by themselves. However, there's no reason to expect that this will always be the case. Epistatic interactions, in which combined risk is greater (or less) than the sum of the risk from individual genes, are difficult to identify with genome scans and even harder to untangle. If epistasis is strong, then just a few genes - each with a weak effect by itself, well below the threshold of a scan - could in concert explain a large chunk of genetic risk. Such a situation would be largely invisible to current approaches.

The solution: Large sample sizes, and clever analytical techniques. I'm not going to attempt a more detailed answer as this area is well outside my knowledge zone - but fortunately, it's an active area of research (see, for instance, the Epistasis Blog). I'd welcome any comments from people who know more about epistasis than I do about the likely scope of this problem and the methods that will be used to resolve it.

Copy number variation
The problem: One of the great surprises of the last five years has been the discovery of widespread, large-scale insertions and deletions of DNA, known as copy number variations (CNVs), in even healthy genomes. CNVs are now known to account for a substantial fraction of human genetic variation, and have been shown to play a role in variation in human gene expression and in human evolution. It seems highly likely that CNVs will be responsible for a non-trivial proportion of common disease risk.

However, our understanding of these variants is still in its infancy. The chips currently used in GWAS, which interrogate single base-pair variations between individuals known as SNPs, can be used to detect a small proportion of CNVs indirectly (by looking for distortions of signal intensity or inheritance patterns), and may effectively "tag" a fraction of the remainder (by using SNPs that are very close to the CNV, and therefore tend to be inherited along with it). However, the vast majority of copy number variation remains invisible to current GWAS technology.

The solution: High-resolution tiling arrays - chips containing millions of probes, each of which binds to a small region of the genome - can be used to explore CNVs in some areas of the genome, but they break down for the large fraction of the genome containing repetitive elements. Ultimately, the complete detection of CNVs from patients and controls will require whole-genome sequencing, preferably using methods with much longer read lengths than the current crop of rapid sequencing technologies.

Epigenetic inheritance
The problem: Not all inherited information is carried in the DNA sequence of the genome; a child also receives "epigenetic" information from its parents in the form of chemical modifications of DNA that can alter the expression of genes - and thus physical traits - without changing the sequence. Although epigenetic inheritance is known to occur, the degree to which it influences human physical variation and disease risk is essentially totally unknown.

All existing technologies used in GWAS are based on DNA sequence, and thus don't detect epigenetic variation. It is even invisible to full-genome sequencing.

The solution: It first needs to be established that epigenetically inherited variations do actually contribute a non-trivial fraction of human disease risk. If so, techniques currently being developed to identify these variants in a high-throughput fashion could be used to perform EWAS (epigenome-wide association studies).

Disease heterogeneity
The problem: Some "diseases" are actually simply collections of symptoms, which may stem from multiple, distinct genetic causes. Lumping patients with fundamentally different conditions into a single patient cohort for a GWAS is a recipe for failure: even if there are strong genetic risk factors for each one of the separate conditions, each of these will be drowned out by the noise from the other, unrelated diseases. The problem is that for some diseases - particularly mental illnesses, where causation lurks deep within the complex and poorly-understood human brain - the knowledge and tools required to separate patients into distinct sub-categories simply may not exist yet.

The solution: The geneticists can't fix this one - it will take a combined effort from clinicians and medical researchers to break down complex diseases into useful diagnostic categories, which can then each be subjected to separate genetic analysis. In the cancer arena, conditions previously lumped together as one entity have now been separated using new technologies such as gene expression arrays; similar approaches will no doubt prove fruitful in a range of other diseases, although the inaccessibility of brain tissue will make it more difficult to apply such approaches to mental illness.

The future of genetic association studies
Current chip-based technologies for genome-wide analysis, while having some success in identifying the lowest-hanging genetic fruit for many common diseases, seem to have already started to run up against barriers that are unlikely to be overcome by simply increasing sample sizes. These technologies should really be regarded as little more than a place-holder for whole-genome sequencing, which should become affordable enough to use for large-scale association studies within 3-5 years.

The application of cheap, rapid sequencing technology is likely to generate a harvest of new disease genes that far exceeds the yield of current GWAS, by providing simultaneous access to both the rare variants and copy number variations that are inaccessible to current chip-based approaches. However, building a more complete catalogue of the heritable variants that drive common disease risk will require more than just cheap sequencing: it will also take advances in clinical diagnostics to better sub-categorise patients into homogeneous groups, as well as new and powerful analytical approaches to cope with the torrent of sequence data, and to efficiently identify epistatic interactions between disease variants. To have any chance of picking out variants of small effect from whole-genome sequencing data sample sizes will have to be enormous - massive cohorts currently being assembled, such as the 500,000-person UK Biobank and a similar NIH-funded study currently in the works, will provide essential raw material for the selection of participants. Naturally, to be applicable to humanity as a whole, cohorts will need to be gathered separately from many different human populations.

Finally, epigenetic variation remains a wild-card of uncertain significance, which will need to be tackled with a different set of high-throughput technologies (although it's likely that many of these will feed on advances in high-throughput sequencing).

Although I probably sound pretty negative about GWAS, I want to emphasise that the current problems are the result of technological limitations that will soon disappear. Barring global catastrophe, within the lifetimes of most of those reading this post we will have a near-complete catalogue of the genetic variants influencing the risk of most of the common diseases that plague the industrialised world (and, hopefully, many of those that plague the rest of humanity). Together with parallel advances in medical science, this catalogue will provide an unprecedented ability to predict, treat and potentially completely eliminate a host of common diseases. It will also bring social and ethical challenges of unprecedented magnitude - but that's a topic for another post...


Subscribe to Genetic Future.

7 comments:

Steve Murphy MD said...

Great post!
It highlights the shortcomings of our science. I don't however want it to lessen any of the REAL genomic medicine tools we already have...including PgX
-Steve
www.thegenesherpa.blogspot.com

Larry said...

Very thorough! I would challenge you on your assertion that selection is a major contributor to human population differences. Humans are pretty young, 100,000 years, 5,000 generations is not a lot of time for selection to make it's mark on the genome unless the selection level is intense. Yes, there are loci that have been shaped by selection but they are few. Drift has produced many differences in allele frequencies but the mechanism that produced these differences, chance, is more mundane that natural selection (and has no "direction") .

I suspect that the main reason that studies may not replicate is due to differences in environment between populations. Living conditions, diet and other cultural factors clearly differ between populations and can change quite rapidly (at least as compared to evolutionary change).

Daniel said...

Hi larry,

I probably should have fleshed that claim out a bit, or maybe that's a topic for another post.

But briefly, the traditional view (that selection hasn't acted much on modern humans in the last 50,000 years) has taken a hefty beating over the last few years, and is now generally accepted to be wrong among human geneticists. Recent genome-wide scans for the genetic signatures of natural selection (for instance here, here and here) have provided compelling evidence that modern humans were in fact subject to massive and recent selective pressures, affecting hundreds (and possibly thousands) of genes. One study (which still needs to be validated) has even suggested that evolution has been more rapid in the last 40,000 years than at any other time in the history of the human lineage!

Importantly, most of these signals of selection are restricted to a subset of human populations, indicating selection for adaptation to specific new environments. It seems likely that a combination of increasing population size (which increase the availability of beneficial mutations) and exposure to massive environmental changes (both climatic and man-made, e.g. dietary shifts due to agriculture) have driven this rapid evolutionary change.

Now, it's true that genetic variation between human populations is not massive - all humans are quite similar, at a genetic level - but what these studies suggest is that the differences that do exist have been driven by selection to a significant degree, and are thus enriched for functional variants. These variants are a priori more likely to be involved in common disease.

Daniel said...

Just to be clear: I'm certainly not arguing that all differences in allele frequencies between populations are the result of selection (you'll note in my post that I stated that "the rest of the genome passively gained and lost genetic variation"); indeed, the majority of differences will certainly be due to drift. However, selection has certainly played a substantial role in population divergence that was under-appreciated until quite recently - and this will likely be particularly relevant for disease-associated genes.

Heide said...

Very concise and complete summary! Thanks!

di said...

This is a really nice write-up. Thank you!

Gerry said...

Nice review.

However, to suggest that bipolar disease has a strong heritable component is up for debate.

In fact, many recent studies have raised concerns over the tremendous over-diagnosis of this disease.

Maybe it is a multiplicity of combinations of susceptibility genes recently dicovered for major depressive disorder and mania, as well as what may be more important environmental factors. This is especially true given the very large array of different symptoms that fall under the disease category of 'bipolar disease'