Thursday, February 28, 2008

23andMe: give us information to "help future generations"

Esther Dyson, a director at 23andMe, has a new post on 23andMe's blog The Spittoon. Dyson acknowledges how little we currently know about the functional effects of most variable sites in our genome, and explains how 23andMe plans to address that issue:
To learn more, researchers need to collect thousands of genetic profiles – and the health data connected with each of them – to find correlations between the two. That leads to a second goal of 23andMe – to collect a large database of genetic information and then come back to you over time with invitations to provide specific health data and participate in research.

As David Hamilton has previously pointed out, the underlying business model of most of the current personal genomics companies is almost certainly not reliant on the money that customers pay for their genome scans (the profit margin on each scan is likely quite small, although I guess it helps to pay the bills). Instead, the plan is to aggregate genetic and trait data from customers and use it to find new genetic associations. These new associations can be fed back to customers, but they can also be used to create a data-set that might be of value to biotech and pharmaceutical companies. That's almost certainly where the personal genomics companies are hoping the real money is.

This plan has the potential to be a win for both customers and companies, as Dyson explains:

We’re not asking you to do this for purely altruistic reasons - either on our part or on yours. We’re a profit-seeking company, even though our founders and employees – and directors! – all share the vision of better understanding of everyone’s genomic make-up. As for you, the research results your data help produce could translate directly into benefits for you, or at least for your children, grandchildren and friends.

I've previously expressed my doubts about how useful the information generated by 23andMe will be, at least if they rely purely on self-reported trait data from online surveys (of course, they have other options). But perhaps I'm underestimating the power of incentives: customers want accurate new associations to be added to the 23andMe database to increase the value of their $1000 purchase, which might be enough to make them pay extra care to the information they put in themselves.

Of course, the ultimate would be a collaboration between 23andMe and Google's medical record storage service - access (with consent, of course) to both DNA samples and detailed health records from thousands of people would be a phenomenal engine for commercial success, and hopefully also a bonus to the research subjects. However, I suspect that the outraged screams from privacy advocates would be audible from space.

Wednesday, February 27, 2008

Why do we have common risk variants for metabolic diseases?

ResearchBlogging.orgI've had a half-finished post on this article sitting in my "to blog" pile for some time now, until finally a post by Yann prodded me into actually finishing it off.

The hypothesis underlying this study is straightforward: common variants in human genes associated with metabolic diseases arose due to recent adaptation to new climates.

This hypothesis rests on a chain of logic that it's worth spelling out in full. Basically, as modern humans migrated out of our warm African homeland into novel environments outside of Africa, natural selection favoured genetic variants that allowed them to adapt to these environments. This much is certainly true; such variants are responsible for many of the more visible characteristics that differentiate human populations, most notably skin colour (PDF) and body shape. What the authors of this study go on to hypothesise (quite reasonably) is that selection for adaptation to novel non-African climates would also have acted on a set of other, less visible characteristics: metabolic traits related to things like energy balance and nutrient retention.

Gathering the data
To test this hypothesis the authors needed a set of genes that were likely to play a role in influencing these traits. This list contained three types of genes: (1) a set of 39 "seed genes" that emerged from a quantitative literature review for genes metabolic diseases (type 2 diabetes, obesity, hypertension and lipid abnormalities) - the assumption being that genes and genetic variants that are involved in metabolic diseases may also have played a role in metabolic adaptations to climate; (2) a set of 35 other genes that may be functionally related to the seed genes, based on an algorithm that looked at known gene-gene or protein-protein interactions to identify; and (3) a further 8 "wild card" genes "with strong evidence for involvement in metabolic syndrome phenotypes".

The authors then used data from the HapMap project to identify 873 "tag SNPs" - genetic variants that capture most of the common genetic variation within these genes. In addition, they chose 210 "control SNPs" from non-protein-coding regions of the genome that were considered a priori unlikely to be targets of selection. All of these variants were then analysed in 964 individuals from 52 populations from the Human Genome Diversity Panel (yes, the same panel that was analysed in those two massive studies published last week), as well as two other populations from Africa.

That's a huge amount of genotyping work, but to test their hypothesis the authors needed one more set of data: information about the climate that each of their 54 populations evolved in. Unfortunately it's very hard to know the exact values of different climate variables over the last 50-100,000 years, so the authors substituted in modern values (more on this later) for six major variables: rainfall, humidity, minimum, maximum and average temperature, and short wave radiation flux. Using the magic of statistics, they could reduce these six variables down to just four summary parameters, called summer PC1 and PC2, and winter PC1 and PC2.

Mining the data
The underlying principle of the analysis is quite intuitive: for each SNP, compare the allele frequency in each population with the climate variables experienced by that population, and see if there is any correlation - for example, a genetic variant that has been selected for cold tolerance should be present at a higher frequency in populations living in colder climates.

The authors apply this basic principle in several different ways, and basically show that their metabolism genes as a whole show stronger correlations with climate variables than do the 210 control SNPs. That means that some of the variation in these genes can't be explained simply by the historical movement of modern humans, but is likely to be partially the product of natural selection for climate adaptation.

The authors then point out some interesting examples of correlation within their dataset: for instance, a protein-altering variant within the leptin gene, which is known to play a major role in regulating appetite, energy balance and (crucially) the generation of heat by muscle, is associated with winter climate variables, while a non-coding variant within the TCF7L2 gene, the best-replicated gene associated with type 2 diabetes, is more weakly associated with summer climate.

However, the strongest associations were seen for the RAPTOR gene, which plays a role in "nutrient signalling, mitochondrial oxygen consumption and oxidative capacity". Variation in the RAPTOR gene is strongly correlated with both latitude and with winter climate variables; the pie-charts on the chart below show the frequency of one particular variant in this gene in the 54 surveyed populations, laid over a colour-coded map of winter maximum temperature. You can see immediately that the black variant in the pie graphs has become more common in the colder parts of the world.



Altogether, the authors make a convincing case for variation in metabolism-related genes being correlated with climate variables to a greater degree than would be expected by chance. That suggests that the guiding hand of natural selection has played a role in shaping the pattern of variation in these genes as modern humans moved from continent to continent over the last 100,000 years.

Some caveats
In any study of this complexity there will be something to complain about, and I intend to seize the opportunity to do so! One of my criticisms is fairly trivial, and the other is potentially more serious.

The first potential problem - one that the authors acknowledge early on - is the use of modern climate data as a proxy for the climates experienced by our ancestors, which have changed considerably over the period that modern humans have been moving around outside Africa (there was that whole Ice Age thing, for example!) This probably has relatively little impact on the results reported here: what matters most is the relative climate in different parts of the world, which I suspect hasn't changed that much - whether it's 18,000 BC or 2008 AD, Africa is still hotter than northern China. However, it would be interesting to know if estimates of historical climate variables are high-resolution enough to use in this type of study - any historical climate experts out there care to comment?

A second, more serious problem is alluded to by the authors in the final paragraph of their discussion: "it is unclear whether all the signals of spatially varying selection reported here are the result of adaptations to climate rather than other environmental variables". In other words, climate variables are known to be closely associated with other features of the environment, and these other features could in fact be the underlying drivers of the effects on the metabolic genes. The authors mention two such possible confounding factors: the diversity of parasitic and infectious disease species, and resource availability, both of which obviously vary with latitude and with climate variables such as temperature and rainfall. For instance, the map below (from this paper, via GNXP) shows the global distribution of vector-borne pathogens (infectious disease carried by non-human animals, e.g. malaria) - there's a non-coincidental concordance with the climate map shown above.



What's my point? Simply that a substantial proportion of the selection observed in this study may be due to some of these confounding variables, rather than with climate per se. Given that we're talking about metabolic genes, differential resource availability (and thus altered diet and food-growing practices) is a particularly powerful alternative explanation for the observed correlations, but effects of pathogens can't be completely ruled out.

This criticism is by no means disastrous for the study - selection is still interesting, whatever its cause - and the authors raise this criticism themselves, but I think the point could have been driven home more clearly. The title of the paper, for example, implies to me that adaptation to climate is the major selective factor here, and that's simply not a conclusion that can be drawn from these data.

What next?
This study has done an admirable job of opening up this topic for discussion and analysis, but there's a lot more to be done in this area. The first step will be to examine in detail the evolutionary history of these genes in a worldwide panel of populations, which will require someone to perform genome-wide genotyping of a whole bunch of humans - oh wait, somebody already did that! I think we can safely expect that the massive genetic surveys published last week will be already being pored over for evidence of local positive selection in any number of genes. A careful analysis of signatures of selection around the genes identified in this study would be a great help in untangling the issues here.

Secondly, we need more genome-wide association studies looking specifically at the traits that are most relevant to our recent evolutionary history. This study looked at some carefully picked candidate genes. A much better study further down the line would be to look at a more objective set: all of the genetic variants associated with obesity, type 2 diabetes and other metabolic diseases based on massive genome-wide studies in multiple populations. We'll have a pretty good early version of this set within the next few years.

Citation: Hancock, A.M., Witonsky, D.B., Gordon, A.S., Eshel, G., Pritchard, J.K., Coop, G., Di Rienzo, A. (2008). Adaptations to Climate in Candidate Genes for Common Metabolic Disorders. PLoS Genetics, 4(2), e32. DOI: 10.1371/journal.pgen.0040032

Tuesday, February 26, 2008

Ann Turner compares 23andMe and deCODEme

On the off chance that you haven't already seen it, check out genetic genealogist Ann Turner's extremely useful summary and comparison of the services offered by 23andMe and deCODEme, kindly hosted by Eye on DNA. The take-home message:

In the meantime, I think the tests are most suitable for those willing to explore the next frontier, with all its unknowns and with the possibility of less expensive tests coming online within the next few years. The cost-benefit analysis will be tricky for everyone, not just for my own little niche. But if you decide to proceed, I don’t think you can go wrong with either company.
For what it's worth, I think you're better off waiting a few years for large-scale sequencing to become affordable. But if you're really keen to be an early adopter, both companies have their pros and cons: 23andMe is altogether slicker and easier to use, while deCODEme offers almost twice the number of markers in their test (and is thus slightly more likely to have coverage for new disease markers that emerge over the next few years).

I suspect 23andMe's more intuitive interface would make them a better purchase for the lay user, while those with a bit of bioinformatics expertise might prefer deCODEme's denser SNP set for carrying out targeted data-mining (as Ann Turner is considering for identifying a hereditary deafness gene in her own family).

Sunday, February 24, 2008

Knome signs up first two paying clients for whole-genome sequencing

Yesterday's press release from Knome has generated surprisingly little interest, but it's actually a pretty big deal: the company, in collaboration with the Beijing Genomics Institute, will be beginning whole-genome sequencing for its first two paying clients within the next few months. As the release says, these will be "the first individuals in the world to have their genome sequenced by a personal genomics firm".

The two clients have (wisely) chosen to remain anonymous at this stage. In return for around $350,000 in cold hard cash, they'll both be receiving "both sequencing and a comprehensive analysis from a team of leading geneticists, clinicians and bioinformaticians".

As I've noted before, the interpretation of whole-genome sequencing is complicated by the fact that no-one has a clue about the functional effects of most variations in the genome, and I wonder if these first clients will feel that they receive anywhere near enough useful information to warrant that hefty price tag.

It's true over the next few years there will be much better systems developed for predicting functional effects, and these customers' sequences will be ready and waiting to take advantage of this progress (whereas the genotyping data provided by the current crop of personal genomics companies will become increasingly obsolete). However, while this progress in interpretation is being made the cost of whole-genome sequencing will simultaneously be dropping by orders of magnitude. From a pure cost-benefit perspective the two customers would almost certainly be better off simply waiting for a few years, for a time when the cost of sequencing and the value generated by new analytical techniques start to meet half-way.

Of course, their loss is our gain. The willingness of wealthy early adopters to pay excessive amounts for untested technology is a big driver of progress: Knome (and everyone else keenly watching this experiment) will learn a great deal about the process of sequencing and interpreting genome sequences as a result.

And so, anonymous customers, I salute you: your willingness to spend large amounts of money for limited information will help to make my genome sequence cheaper and more useful, three to five years from now!

Saturday, February 23, 2008

23andMe helps to fund worldwide survey of genetic diversity

I'll be posting more about the two massive genome-wide surveys of human genetic variation that were published this week - one in Nature and another in Science - once I've had some time to digest the vast amount of information they contain (the Nature paper comes with an overwhelming 66 pages of close-packed supplementary information!)

For now, I just wanted to point out that the more comprehensive of the studies (the Science one), which looked at 650,000 genetic markers in more than 1,000 individuals from 51 populations, was partially funded by 23andMe - as the company quite modestly notes at the bottom of a recent post on their official blog, The Spittoon. That's a smooth move on 23andMe's part - funding the study while allowing free access to this valuable data-set builds good-will in the scientific community, while 23andMe secures a very useful source of information for their ancestry comparisons.

If you want to get hold of the raw genotype data for the studies, grab them from the CEPH website. In fact, these data were freely available for several months prior to their official publication (I've been using them in my own research) - serious kudos to the researchers for granting such open access to their hard-earned data.

23andMe demo account goes live

23andMe has just announced free access to a demo account. So far I'm impressed by the slickness of the web interface (particularly compared to the demo account from deCODEme, so memorably labelled "underwhelming" by David Hamilton). The ancestry section in particular is easy to navigate and contains a wealth of information on different populations and geographical regions.

Well worth a look, particularly if you're considering actually shelling out for the full service.

Update 26/2/08: David Hamilton favourably reviews the 23andMe offering over at VentureBeat.

Thursday, February 21, 2008

How many harmful mutations do you carry?

ResearchBlogging.orgAccording to a new article in Nature (full text for subscribers only), the answer depends on your ancestry: Europeans carry more potentially fitness-reducing mutations than individuals from Africa.

The authors support this claim with sequencing data from more than 10,000 genes, obtained from 20 individuals of European ancestry and 15 African-Americans. As far as I can tell the same data were used in a 2005 Nature paper by the same group to address a completely different question, which I note simply because getting two Nature papers from the same data is a rare and wondrous thing!

Here, the authors examine the patterns of genetic variation within the protein-coding regions of their 10,000 genes. This variation can be broken down into two major classes: variations that don't change the sequence of the encoded protein, and are thus unlikely to affect fitness; and variations which do alter protein sequence. The protein-changing variants were further classified into three groups, based on how likely they are to disrupt the function of the encoded protein: benign, possibly damaging, or probably damaging.

The authors find some potentially interesting differences between the European and African samples: although the total amount of variation is substantially lower in Europeans, the proportion of it that rates as potentially harmful is higher. This is consistent with the notion that modern Europeans are derived from a relatively small population that migrated out of Africa between 50 and 100 thousand years ago. This "bottleneck" resulted in a loss of genetic variation, while the subsequent expansion as Europeans colonised new and fertile lands resulted in variants that have slightly negative effects on fitness "surfing" to disproportionately high frequencies.

The differences between the populations is actually pretty small, and this interpretation of the data is controversial (as noted in a Nature news article). But I'll leave that debate for another blogger - what interests me right now are the implications of these results for future analysis of large-scale sequencing data.

The article notes that the average individual in their data was heterozygous (that is, carried one "good" copy and one "bad" copy) for more than 400 possibly or probably damaging variants, and homozygous (that is, carried two "bad" copies) for more than 90 of these variants. Bearing in mind that these data are based on around half of the genes contained in our genome, this means that when you have your entire genome sequenced (as we are all likely to, at some point in the next twenty or so years) you will find that you carry one copy of more than 800 potentially damaging variants, and two copies of almost 200.

Among these potentially damaging variants there will be massive differences in the risk of disease for you and for your children: many will have little or no effect on health even if you carry two bad copies, whereas some may be severe enough to cause serious disease even if you carry only one bad copy (like a Huntington's disease mutation).

We know from studies of inbreeding that all of the variants will add up to around two to five "lethal equivalents" - which could mean five variants that would each be 100% lethal if you happened to carry two copies, or 500 variants that each carry a 1% chance of death (with a more likely scenario being somewhere in between!). With our current understanding of molecular and cellular biology we can confidently predict the final effects on health of only a tiny fraction of these variations - and that's not even considering the substantial proportion of potentially nasty variation occurring outside protein-coding regions, for which it is currently essentially impossible to predict function. Within five to ten years, we will have the capacity to cheaply sequence your entire genome and find all of these variations, but we will be unable to predict the likely health impacts of the vast majority of them.

It's hard enough providing genetic counselling for a family carrying one serious disease mutation. How will counsellors deal with a patient carrying around 1,000 potentially deleterious mutations, most of which we don't really understand?

Added in edit: Those interested in further exploration of the implications of these data for recent human evolution should check out John Hawks. Thomas Mailund also has an excellent post discussing this paper in detail.

Lohmueller, K.E., Indap, A.R., Schmidt, S., Boyko, A.R., Hernandez, R.D., Hubisz, M.J., Sninsky, J.J., White, T.J., Sunyaev, S.R., Nielsen, R., Clark, A.G., Bustamante, C.D. (2008). Proportionally more deleterious genetic variation in European than in African populations. Nature, 451(7181), 994-997. DOI: 10.1038/nature06611

New personal genomics company looks for rare disease mutations






Blaine Bettinger of The Genetic Genealogist breaks the news of the launch of a new personal genomics company, DNATraits.

The twist: rather than scan your entire genome for common variations, like the Me Two (23andMe and deCODEme), DNATraits allows you to choose the genetic tests you want from an online catalog: for instance, an introductory panel allows you to look for variants that cause four rare diseases, at a cost of $300. A lazy back-of-the-envelope calculation based on the carrier frequencies quoted by the company suggests that more than 8% of people of European origin will carry

New HIV susceptibility locus identified

I don't need to convince anyone that HIV is a major health concern, particularly in Africa and (increasingly) East Asia. It's long been known that susceptibility to HIV infection varies between individuals, with some people engaging in high-risk activities for long periods of time nonetheless remaining uninfected. In addition, individuals infected with HIV show considerable variation in the rate at which the disease progresses to full-blown AIDS. This variation is naturally of interest to researchers looking for treatment options: if they can figure out why some people are naturally resistant to the virus, that may help to find ways to help other people fight off infection.

Variation in susceptibility to HIV is at least partly genetic, but the specific protective variants identified to date - most notably the CCR5 Δ32 polymorphism, and variation in the HLA region - are thought to capture only a small proportion of the total genetic variation. For instance, a recent genome-wide scan for genetic variants that influence viral load in people infected with HIV that have not yet progressed to full-blown AIDS identified variants that explain only 15% of the total variation in infection risk. It's likely that several other genetic variants of moderate effect are still out there, and with that in mind a new study published in PLoS Biology used a different approach to look for other genetic regions that influence HIV susceptibility.

In their first series of experiments, the authors avoided the messiness of dealing with live human beings - who differ from one another with respect to their behaviour, their degree of exposure to HIV and the strains of HIV they encounter, in addition to their genes. Instead, they looked at a simplified model of HIV infection involving white blood cells (B and T cells) grown in the lab.

The study first used B cells derived from 198 individuals from 15 multi-generation families. They infected each of the cell lines with a modified version of HIV and tested how susceptible each line was to infection. As expected, there was considerable variation in susceptibility, and that variation was substantially genetic - the authors estimate that around half of the variance in susceptibility was genetically determined.

The authors then tested markers from throughout the genomes of these cell lines to find the specific regions of the genome that correlated with this variation in susceptibility. This approach identified a novel region on chromosome 8, around the SNP marker known as rs2572886: the "susceptible" version of this polymorphism was associated with a 1.6-fold increase in susceptibility to HIV infection in a separate experiment on T cells (the usual hosts of HIV in the human body).

Of course, the big question is whether an increased susceptibility in isolated cells translates into increased susceptibility in real live humans. The authors attempted to address this, but their results are tentative at best: of the two (relatively small) groups of humans they examined, one showed a slightly greater increase in viral load and a slightly greater decrease in white blood cell count over time for carriers of the "susceptible" version, while the second cohort showed no association at all. Pooling the two cohorts resulted in no significant association overall.

No doubt we will see follow-up studies in larger human groups over the next couple of years that should provide more definitive answers - until then, the authors admit, "this association should be considered suggestive".

The authors make a valiant but ultimately fruitless attempt to pin down the function of their polymorphism, which unfortunately rests in the middle of a large region of the genome containing no known genes. They show somewhat sketchy evidence that the region containing their variant physically associates with regions in other, quite distant genes, suggesting that it may play a role in regulating the expression of these genes. Unfortunately these genes are expressed at low levels, making it difficult to test this hypothesis directly, and experiments to look at the effects of disrupting these genes on HIV infection susceptibility were inconclusive.

How important is this new marker compared to previously identified genetic determinants? Thoughtfully, the authors actually provide estimates of the proportion of variance explained by each of the known markers, using their own data-set. This comparison certainly puts their results in perspective: their new marker explains less than 1% of the variance in white blood cell count and viral load, while the three major markers identified in the previous genome-wide scan explain between 5.8 and 9.6% of the variance each. The well-known CCR5 Δ32 polymorphism also ranks fairly poorly, explaining between 0.4 and 1.9% of the variance.

Where is the remaining variation hiding? It's likely to rest in two places that are largely inaccessible to the current generation of SNP chips: rare variants of moderate to large effect, and copy number variations (CNVs - that is, insertions and deletions of regions of DNA). Accessing this variation will require a combination of sequencing and the use of new chips for detecting CNVs. Ideally, such studies should also be carried out in individuals of African ancestry, since African populations are currently experiencing the greatest impact of HIV, and it's highly likely that susceptibility alleles will differ between African and European groups.

While the results of this study are far from conclusive, the authors deserve praise for a clever and thorough experimental strategy, combining genome-wide data from cultured cells, validation studies in human cohorts, and functional studies. This combined approach is likely to become more and more common as the "low-hanging fruit" - that is, the common variants with large effect on disease risk - are picked off by simple genome-wide association studies, and journal reviewers start to demand higher levels of evidence to support new associations.

(Image of HIV-infected white cells courtesy of Stanford University.)

Loeuillet, C., Deutsch, S., Ciuffi, A., Robyr, D., Taffé, P., Muñoz, M., Beckmann, J.S., Antonarakis, S.E., Telenti, A. (2008). In Vitro Whole-Genome Analysis Identifies a Susceptibility Locus for HIV-1. PLoS Biology, 6(2), e32. DOI: 10.1371/journal.pbio.0060032

Saturday, February 16, 2008

Wednesday, February 13, 2008

Strategies for phenotype collection

In a post at 23andMe's blog The Spittoon that I also blogged about yesterday, there's an interesting paragraph:
But at 23andMe we believe, as one of the conference speakers noted, that the bigger challenge right now is collecting so-called phenotypic information. Phenotype is all the physical and behavioral stuff your genotype can affect, such as height, eye color and disease susceptibility. Both genotype and environment influence phenotype, and the research challenge is to gather and interpret the connection between the two. We can then make more detailed and accurate predictions from your genome.
I couldn't agree more: the collection of detailed, accurate phenotype data is one of the major limiting factors in modern genetic association studies (in comparison, collecting genotype data at hundreds of thousands of different variable sites is a walk in the park). If 23andMe wants to be able to generate their own in-house genetic associations to sell on to biotech and pharmaceutical companies, they will need to ensure that their phenotype collection process is thorough and rigorous.

The problem: how on Earth does 23andMe intend to collect this level of data from their existing customers, who are scattered all over the planet, and unlikely to be willing to be dragged into a medical centre for a detailed check-up (or to share their private medical data with a faceless corporation)? As I've said before, my instinct is that online surveys are a generally noisy source of data that are unlikely to satisfy the requirements of big pharma.

I guess there are a few alternative strategies for 23andMe and other personal genomics companies looking to generate their own association data:
  1. Careful, targeted recruitment of new customers suffering from specific common diseases, perhaps through offering discounted services to disease advocacy groups;
  2. Use their existing customer base as a way to demonstrate expertise in genotyping and interpretation, which can later be leveraged to sell genotyping services to corporations and/or research groups.
It will be very interesting to see which strategies they end up pursuing.

Tuesday, February 12, 2008

23andMe looks towards a sequencing future

Right now, personal genomics companies like the Me Two (23andMe and deCODEme) and their less well-advertised competitor SeqWright offer to give you your DNA sequence at up to one million positions throughout your genome - less than 0.05% of the total. While this approach is actually surprisingly informative about patterns of common genetic variation throughout the genome, it still provides a limited window into your genome as a whole.

Precisely how limited this window is has become clear from the recent results of large genome-wide association studies for common diseases like lupus or diabetes. While the successes of these studies have been well-publicised - dozens of new genetic variants that can be used to predict future risk of disease - the publicity has glossed over a slightly dirty little secret: the common genetic variation surveyed by chip-based approaches captures a relatively small proportion of the total genetic risk for most common diseases.

Where is the rest of the disease risk hiding? A large proportion of this risk is likely conferred by a large number of rare variants, each of which may be restricted to just a few families, but which add up to a huge amount of total risk. Such variants will be completely invisible to chip-based genotyping methods since they are not "tagged" by any of the common variations detected by the chips. The only realistic way to detect such variants will be through large-scale sequencing - determining the sequence at every position in the genome (or at least a substantial fraction of it).

So how long will it be before sequencing technologies can be brought down to the costs that personal genomics customers are willing to bear, as opposed to the $350,000 genome sequence currently offered by Knome? This is a difficult question to answer, as David Hamilton from VentureBeat explains in a great recent analysis centred around an article in the NY Times. But my best guess: we will see the first sequencing-based forays into the personal genomics (possibly sequencing just a few dozen important genes) within the next twelve months, and I would be very surprised if whole-genome sequencing doesn't reach the broad personal genomics market (i.e. at a cost of less than $5000) well within the next three years. Given the competition in this area, and the money being pumped into development by both governments and private consortia, it's a fair bet that the technology will move fast.

Existing personal genomics companies are also well aware of the need to move fast to stay on top of the shifting technology and keep their grip on the market. In a recent blog entry, 23andMe's DarrenP spells out how cheap sequencing will change personal genomics, and explicitly foretells the entry of 23andMe into the sequencing market:
By some estimates, the cost of sequencing a human genome could be a few thousand dollars by 2014.

23andMe is already riding this wave. A dozen years ago it would have cost about $600,000 to examine the 580,000 points, known as SNPs, that we include in our $999 service. Eventually we’ll be able to give you your complete sequence for that price.

That may be somewhat disappointing for 23andMe's existing customers, who will watch their $1000 genetic data become rapidly obsolete over the next few years - but this is an experience familiar to anyone who buys a new computer or other high-tech device only to watch it succeeded by cheaper, more powerful alternatives within a few months. In addition, I'd guess that 23andMe will offer a sequencing discount to current customers to help hold onto their share of the market.

Of course, the interpretation of large-scale sequencing data will bring its own set of challenges. A common genetic variant on a chip that is associated with, say, an elevated risk of prostate cancer, is comparatively easy to interpret: if you have the variant, you're at higher risk. But what if your gene for androgen receptor turns out to contain a rare mutation in its regulatory region that might alter the expression of the gene? Because the mutation is rare, there's unlikely to be any solid data on its effect on disease risk. Amplify that uncertainty by the hundreds of variants of questionable functional effect that will likely be found in any genome, and the end result for a customer is likely to be confusion rather than enlightenment.

Nonetheless, the rapidly dropping cost of sequencing will revolutionise personal genomics - and as David says, the jostling for position over the next few years will certainly be a heck of a lot of fun to watch.

Sunday, February 3, 2008

23andMe: testing saliva biomarkers?

Added in edit: No, they're not. And I would have known that very quickly if I'd just done a little extra research - see end of post!

In the comments to my last post, David Hamilton from VentureBeat points to an article he wrote in November last year about 23andMe's likely business plan (put simply: selling aggregated genetic and trait data to big pharma). I agree that this is certainly likely to be the major source of cash-flow that the company is relying on. However, I'm somewhat sceptical that big pharma will be willing to pay huge amounts for a data-set based purely on self-reported traits (i.e. the current customer data-set).

Of course, there is another possibility that just occurred to me: could 23andMe be using those saliva samples (see kit to left) to test for saliva biomarkers, in addition to extracting DNA? Maybe that explains why 23andMe collects 2 mL of saliva, whereas deCODEme does the same job with a simple cheek swab?

To explain a little more clearly: "biomarker" is a generic term for molecules (which could be proteins, sugars, or small chemical such as drugs) contained in a tissue sample that can be used to provide information about a human donor's health status. Saliva contains many potentially informative biomarkers, some of which have previously been used to test for the presence of the autoimmune disease Sjögren's syndrome, periodontal disease (i.e. gum infections), and even breast cancer. The technology is still new, but presumably saliva samples could be stored safely for long periods of time and then retested when new biomarkers are discovered and validated. Such an analysis would give 23andMe direct, potentially clinically relevant information that may be much more reliable than self-reported data, and thus more useful to pharmaceutical and biotech corporations.

The big question is: would useful saliva biomarkers be stable enough to survive transport from users' homes to 23andMe, especially given the high bacterial load present in saliva? The producer of the kits used by 23andMe doesn't provide much information relevant to this question (except to say that the kit results in "lower bacterial content"), and I don't know anywhere near enough about this field to even hazard an informed guess. Perhaps someone who knows more than me about saliva proteins can let me know how unlikely this is?

Added in edit: Nope. In the 23andMe consent form it says:
The laboratory processing your saliva sample will analyze your DNA to determine your genetic information. The laboratory will not analyze your saliva for any biological or chemical components, markers or agents other than your DNA.
Damn. Now the interesting question is: why aren't they doing it?

Saturday, February 2, 2008

Google will own it all?

In my recent post "Researchers forced to share", I noted that NIH-funded researchers will now be forced to immediately submit their genetic association data to an online database, dbGaP, with other researchers having free access and the right to publish new analyses of the data following a nine-month grace period. In the comments, Steve Murphy pessimistically replies:
Why care? Google will own it all soon enough.
I know Steve has a big axe to grind in this arena (and is happy to grind it out loud), but is there a grain of truth here? Will Google/23andMe manage to capture and control a substantial proportion of the genomic data generated over the next decade or so?

Unless the world of science is miraculously transformed within the next few years, the answer is no.

Don't get me wrong: if 23andMe succeeds in attracting customers (as I suspect it will), then they will quickly set about combining genetic data from customers with self-reported information about diseases and other physical traits - "anything from symptoms of autism to shoe size". This will eventually give 23andMe (and, potentially, Google) access to a moderately to extremely large data-set with which to find new associations between genetic markers and a range of traits.( How large, of course, will depend on exactly how successful 23andMe is at attracting customers.)

But there will be some major caveats with these data. Most importantly:
  1. the 23andMe data is likely to consist almost exclusively of upper-middle-class, and probably mainly white, suburbanites (unless they repeat their Davos free-kit splurge in Buffalo County, South Dakota, which seems unlikely!), and
  2. as far as I can tell their phenotype data will be entirely self-reported.
Genetic associations vary between human populations - in fact, this is discussed at length in a 23andMe white paper (PDF) - so associations found in wealthy Americans and Europeans can't be easily extrapolated to the rest of the world's population. Self-reported data is generally considered to be a poor substitute for direct measurement, so 23andMe's physical trait values will contain at least some "noise" due to customer wishful thinking or carelessness - which probably won't completely prevent associations from being found, but it will certainly make it harder.

Given these limitations, although I think we will see some novel and interesting associations emerge from 23andMe over the next five years or so, I don't expect to see huge breakthroughs in the genetics of human health. In other words, we might see a new genetic variant associated with left-handedness or a fondness for broccoli - but it's unlikely that 23andMe will be able to find genes linked to type 2 diabetes that haven't already been scooped up by previous genome-wide scans.

Massive surveys like the UK BioBank, on the other hand, have vastly larger sample sizes, cover a much broader range of ethnicities and socioeconomic groups, and have access to direct clinical measurements. The major funding bodies responsible for funding these massive projects, such as the UK's Wellcome Trust or the NIH, are increasingly committing the research groups they fund to policies of free data release. This means that the identity and nature of the genetic risk variants identified in these studies will be freely available to anyone with an internet connection.

At least for the foreseeable future, 23andMe will be relying heavily on the results of these external studies rather than its own in-house data to provide information to its customers. And so long as these research groups make their data freely available, communities like SNPedia will be building databases and free tools into which people can input their own genetic data (generated by private companies or by researchers) in order to learn about their genomes. Google won't "own" this information, and neither will anyone else.