A team led by J. Craig Venter from the J. Craig Venter Institute has just published another paper on J. Craig Venter's favourite topic: J. Craig Venter.This study follows up on last year's publication of the complete sequence of Venter's genome, this time reporting a detailed analysis of a small but quite informative fraction of the genome: the exome, which consists of all of the pieces of DNA (called exons) that directly code for protein molecules.
The exome is a favoured target of geneticists. There are two major reasons for this: firstly, the exome is enriched for functional sequence, whereas non-coding DNA has a much higher fraction of non-functional junk; and secondly, we understand protein-coding DNA much better than we do non-coding DNA. If a novel mutation alters a protein sequence, we have algorithms that can predict (with moderate accuracy) how likely it is to alter the function of the cell. In contrast, for most mutations in non-coding DNA we have almost no way to predict whether they are functional or not. So, like the drunkard looking for his keys under the lamp-post because the light is better there, geneticists are inclined to look hardest at the regions where they actually have some chance of finding something they can understand.
Venter's mutations
The article (which is open access, so you can read it yourself) has a number of interesting factoids about Venter's protein-coding genome that are highly relevant to personal genomics:
- The authors identified 10,389 variants predicted to alter protein sequences;
- Of these, most are common (they estimate that 80-85% are present at a frequency of over 5% in the general population);
- About 1,500 of these variants are likely to actually significantly alter protein function, based on the SIFT prediction algorithm - these are the variants most likely to play a role in shaping human variation and common disease risk;
- A variant is twice as likely to be functionally damaging if it is rare (frequency less than 5%) than if it is common (frequency over 5%);
- Several quite unambiguously protein-damaging mutations were also found (74 would introduce an abnormal "stop" signal, while others create "frame-shifts" that alter large regions of an encoded protein), but many of these fall in genes with poor annotation that may well be non-functional;
- Venter carries seven known disease-associated variants, all present in only one copy (i.e. heterozygous);
- The interpretation of all of these data in terms of making actual health predictions is remarkably problematic, an ominous sign for the ~20 wealthy folks getting their genome sequenced by Knome this year.
Even if a gene is known to be involved in disease, it is difficult to understand if a variant in the gene will have a phenotypic effect. We found that 99% of the [protein-altering variants] in disease genes could not be characterized by current literature. Different mutations in the same gene can cause different phenotypic effects [49], thus making it difficult to interpret possible phenotypes. Furthermore, some variants have phenotypic effects only under certain environments (see SOD2 and BDNF in Table 2 and [48]). Also, when looking at complex phenotypes, multiple variants in coding and non-coding regions are likely to be involved [63]–[66]. This genetic complexity, as well as exposure to various environmental factors, will need to be taken into account in assessing risk for various diseases.In other words, it will be quite some time before we can use a genome sequence to make realistic predictions about overall health (except for the unlucky few who carry mutations unambiguously associated with disease, such as a CAG repeat expansion in the HTT gene - in which case the predictions will tend to be dire). The next few years will be interesting times indeed for personal genomics companies, as their ability to generate oodles of genetic data with cheap sequencing increases exponentially faster than their capacity to explain what the data actually mean.
The challenge of rare variants
I want to draw particular attention to the implications of point 4 above (the fact that rare mutations are the most likely to alter protein function, and thus to have an effect on disease risk). The evolutionary basis for this association is trivially clear: if a variant has a serious negative effect on health then in most cases natural selection will keep it at a low frequency in the population, since really sick people tend to have fewer kids. Disease-causing variants can reach high frequencies under certain conditions (if they also provide benefits under certain situations, or if the disease only hits its victims after they've already reproduced, for instance) but all else being equal, evolution's scythe means that you're far more likely to find disease-causing variants at the rare rather than the common end of the spectrum.
The reason this is so problematic is that rare disease-causing variants are also the hardest to find and characterise. I've mentioned a few times that the current crop of genome-wide association studies (GWAS), while reasonably well-powered to detect common disease-causing variants, have virtually no ability to find rare causal variants - even if these variants explain the majority of disease risk. This probably goes some way to explaining why even massive GWAS are capturing only a small proportion of the overall genetic risk for most common diseases.
This arises primarily because the chips used in current GWAS only efficiently "tag" common variants. However, even once this technological barrier is lifted it will still be fiendishly difficult to assign function to rare variants: because there will be many millions of these variants, each at a low frequency, the sample sizes required to find those few associated with disease risk will be mind-bogglingly large - we're talking cohorts of millions of people, all with large-scale sequence data and well-collected information on environment and health. I have no doubt such studies will eventually be done, but it will take many years before we see the results.
And of course, even with such massive cohorts, the rarest variants (those restricted to single families, or even just a few isolated individuals) will still slip through the statistical cracks - but such variants may well be the most important features in the genome sequence of any given individual, the ones disrupting that crucial tumour-suppressor gene or messing with neurotransmitter expression levels. If you have one of these nasty variants, you'll want to know about it, and you'll want to know what it does.
Beyond geneticsUltimately, geneticists will have to deal with such variants using non-genetic methods. For instance, for many genes it may eventually be possible to create experimental assays that allow researchers to rapidly test whether a novel variant disrupts protein function; the mouse embryonic stem cell assays that can be used to test novel variants in the breast cancer gene BRCA2 are a proof of principle, as well as a demonstration of just how challenging this process will be.
More broadly and ambitiously, we need to build and refine models of how human beings operate at a molecular level, integrating data from many fields of biology. If we understand which proteins interact within which cells, how these interactions influence protein dynamics, and where the binding sites for each interaction lie, we will have a much better chance of inferring the effect of an isolated change in protein sequence on overall cellular function and thus human health. Moving beyond the exome into non-coding DNA will require even more subtle and complex models including protein-DNA binding, the regulation of DNA modification and conformation, and the effects of non-coding RNA.
In other words, ultimate personal genomics - the extraction of every byte of useful predictive information out of an individual's genome sequence - will require nothing less than an atomic-level understanding of the operation of the human machine. Now that is an effort I'd like to see Google throw its weight behind...
(Venter image from Wikimedia Commons.)
Ng, P.C., Levy, S., Huang, J., Stockwell, T.B., Walenz, B.P., Li, K., Axelrod, N., Busam, D.A., Strausberg, R.L., Venter, J.C., Schork, N.J. (2008). Genetic Variation in an Individual Human Exome. PLoS Genetics, 4(8), e1000160. DOI: 10.1371/journal.pgen.1000160
1 comments:
Daniel,
Fantastic post. The problem with this is that scalability could happen much quicker than utility....which is why google will trhow its weight behind just getting the sequence....the rest will be up to us....
-Steve
www.thegenesherpa.blogspot.vom
Post a Comment