Saturday, June 28, 2008

How much data is a human genome? It depends how you store it.

Andrew from Think Gene has finally prompted me to write a post I've been working on sporadically for a month or so. The question is pretty simple: in the not-too-distant future you and I will have had our entire genomes sequenced (except perhaps those of you in California) - so how much hard drive space will our genomes take up?

Andrew calculates that a genome will take up about two CDs worth of data, but that's only if it's stored in one possible format (a text file storing one copy of each and every DNA letter in your sequence). There are other ways you might want to keep your genome depending on what your purpose is.

The executive summary
For those who don't want to read through the tedious details that follow, here's the take-home message: if you want to store the data in a raw format for later re-analysis, you're looking at between 2 and 30 terabytes (one terabyte = 1,000 gigabytes). A much more user-friendly format, though, would be as a file containing each and every DNA letter in your genome, which would take up around 1.5 gigabytes (small enough for three genomes to fit on a standard data DVD). Finally, if you have very accurate sequence data and access to a high-quality reference genome you can squeeze your sequence down to around 20 megabytes.

The details
For the first two formats I'll assume that someone is having their genome sequenced using one of today's cutting-edge sequencing technologies, the Illumina 1G platform. The 1G platform and its rivals, Roche's 454 and Applied Biosystem's SOLiD, are the instruments that are currently being used to sequence over 1,000 individuals for the international 1000 Genomes Project; if you were to have your genome sequenced right now it would almost certainly be using one of these platforms.

The Illumina technology basically sequences DNA as a huge number of short (36-letter) fragments, called reads. Because read lengths are so short and the system has a fairly high error rate, assembling an entire genome would require what's called 30x coverage - which basically means each base in the genome is sequenced an average of 30 times.

Once the reads have been generated, they are assembled into a complete genome with the help of a universal reference genome, the sequence created by the Human Genome Project from a mixture of DNA from several individuals. Even with this high level of coverage there is still considerable uncertainty involved in the process of re-assembling the genomes from very short fragments, and both the algorithms used to perform this assembly and the reference genome are being constantly improved. Thus for the moment there may be some advantage in storing your data in a raw format, so that in a few month's time you can take advantage of better software and more complete reference genome to reconstruct your own sequence in a more complete fashion.

For the third and fourth formats, I've moved into the future: basically, I'm assuming that we now have access to affordable sequencing technology that can generate extremely long and accurate reads from a single molecule of DNA. That would allow you to reconstruct your entire genome - both sets of chromosomes, one from your mother and one from your father - with very high confidence. In that case you no longer need to store your raw data, and we can instead start thinking about the most efficient possible way to keep your entire genome on disk.

Note that in what follows, for the sake of simplicity I am ignoring the effects of data compression algorithms. It's likely that you could shrink down these data-sets (especially the image files) by quite a bit using even straightforward compression.

Anyway, enough background. Let's get started.

1. For hard-core data junkies only: raw image files
To put it very simply, the Illumina 1G platform sequences your DNA by first smashing it up into millions of fragments, binding those fragments to a surface, and then feeding in a series of As, Cs, Gs and Ts. As these bases are incorporated into the DNA fragments they set off flashes of light that are captured by a very high-resolution camera, resulting in a series of pretty coloured images such as the one on the left(which is actually a montage of four images, one for each base). Each of those spots represents a separate fragment of DNA, captured at the moment that a single base (A, C, G or T, each labelled with a different colour) is read from that fragment. By building up a series of these images the machine accumulates the sequence of the first ~36 bases of those fragments in the image, after which the sequence quality starts to drop off.

Almost as soon as these images are generated they are fed into an algorithm that processes them, creating a set of text files containing the sequence of each of the fragments. The image files are then almost always discarded. Why are they discarded? Because, as you will see in a minute, storing the raw image data from each run in even a moderate-scale sequencing facility quickly becomes prohibitively expensive - in fact, several people have suggested to me that it would be cheaper to just repeat the sequencing than to store these data long-term.

How much data? Each tile of an Illumina machine will give you accurate sequence information for around 25,000 DNA fragments. A separate image is obtained for each of the four bases, with each "snap-shot" comprising around 2 Mb of data. That comes to a total of 320 bytes/base. For an entire genome with 30x coverage, that comes out as around 28.80 terabytes of data. That's almost 30,000 gigabytes!

Why store your genome like this? Well, either you believe that image-processing algorithms are likely to improve in the near-future, thus allowing you to squeeze a few more bases out of your data; or you have a huge bunch of data servers lying idle that you want to do something with; or you're just a data junkie. However, your actual sequence data is not readily accessible in this format, so you'd also want to be keeping at least a roughly assembled version of your genome around to examine as new information about risk variants becomes available.

2. For DIY assemblers: storing individual reads
I mentioned above that those monstrous image files are rapidly converted into text files containing the sequence of each of your ~36-base reads. The files that are generally used here are called Sequence Read Format (SRF) files, which are used to store the most likely base at each position in the read along with other associated data (such as quality scores).

How much data? It depends what sort of quality information you keep: at the high end you'd be looking at around 22 bytes/base (1.98 terabytes total) to store raw trace data, while at the low end you could just score sequence plus confidence values for around 1 byte/base (90 gigabytes total). That's starting to become feasible - you could now store your genome data on an affordable portable hard drive.

Why store your genome like this? This is a pretty efficient way to store your raw read data while you wait for improvements in both the reference human genome sequence (which is still far from complete) and assembly algorithms. As with the previous format, though, you'd also want to store your sequence in a more readily accessible assembled sequence so you could actually use it.

3. Your genome, your whole genome, and nothing but your genome
OK, now let's gaze a few years into the future, and assume (fairly safely) that new technologies for generating accurate, long reads of single DNA molecules have become available. This means you can stitch your entire genome together very easily, allowing you to store the whole 6 billion bases of it in a text file - this is the type of data storage approach that Andrew discussed in his post. In essence, you're storing every single base in your genome as a separate character in a massive, 6 billion letter long text file.

How much data? Each DNA base can be stored in two bits of data, so your complete genome (both sets of chromosomes) tallies up to around 1.5 gigabytes of data. If you wanted to store some associated confidence scores for each base (indicating how likely it is that you sequenced that section of your DNA correctly) that might take you up to 1 byte/base, or a total of around 6 gigabytes. Either way, you could now fit your genome on a cheap USB thumb drive.

Why store your genome like this? This is probably the easiest possible format to store your genome in - it contains all the information you need to compare your sequence with someone else's, or to find out if you have that rare mutation in your GABRA3 gene that you saw on the news last night. It's everything you need and nothing you don't.

Now, most sensible people will probably be content with their 1.5 Gb genome, especially as data storage becomes ever cheaper. But a few will want to squash it down further, particularly if they're storing lots of genomes (like a large sequencing facility, or your insurance company). In that case they can go one step further by taking advantage of the fact that at the DNA level all of us are very much alike.

4. The minimal genome: exploiting the universal reference sequence
I don't know who you are, but I do know that if you lined up our genomes you would find that we have a lot in common - almost all of the bases in our genomes are absolutely identical. Indeed, for any two randomly selected humans you will find, on average, that around 99.5% of their DNA is precisely the same (although the precise pieces that are different will of course differ from person to person). We can use this commonality to compress our genomes further using a clever trick: if we have a very good universal human reference sequence, we can ignore all the parts of our genome that match it, and only store the differences.

In practice, then, your personal genome sequence will comprise (1) a header, stating which reference sequence to compare to (this would ideally be a reference sequence from your own ethnic group), and (2) a set of instructions providing all the information required to transform that reference sequence into your own 6 billion base genome.

For convenience, I'll assume that the reference is stored as a single contiguous text file containing all 46 chromosomes joined together. To make your genome, a software package will start at one end of the reference sequence; each instruction will tell it to move a certain number of bases through the genome, and then change the sequence at that position in a specific way (it could either change that base to something else, insert new sequence, or delete the base entirely). In this way, sequentially running your personal instruction set will convert the reference sequence into your own genome, base by base.

How much data? This one is tricky because we still don't have a great idea of exactly how many differences exist between people. I'm going to make some rough guesses using this paper, which compares the genome of Craig Venter to the sequence generated by the Human Genome Project, and assuming that this paper under-represents the total number of variable sites by around 30% (due to missed heterozygotes and poor coverage of repetitive areas). For a diploid genome (i.e. one containing two copies of each chromosomes, one from each parent) this gives an estimate of around 6 million single base polymorphisms, about 90,000 polymorphisms changing multiple bases, and about 1.8 million insertions, deletions and inversions.

Now, the instructions for each of the single base polymorphisms can be stored as 1.5 bytes each on average (enough space to store both the distance from the previous polymorphism, and a new base). Multiple-base polymorphisms will be perhaps 2 bytes each, allowing for the storage of a few additional changed bases. Deletions might be around 3 bytes to store the length of the deleted region. Insertions will be more complicated: if they simply duplicate existing material they might only take up 3 or 4 bytes, but if they involve the insertion of brand new material they will be much larger (3 bytes, plus 1 byte for every 4 new bases inserted). From the Venter genome the average insertion size is 11.3 bases, so let's say insertions take up 7 bytes on average.

Making some more assumptions and tallying everything up I get a total data-set on the order of 20 megabytes. In other words, you could fit your genome and the sequences of about 34 of your friends onto a single CD.

Why store your genome like this? If you have a fetish for efficiency, or if you have a whole lot of genomes you need to store, this is the system to use. Of course, it relies on having access to a universally accessible reference sequence of high quality - and you would probably want to recalculate it whenever a new and better reference became available.

Squeezing your genome even further
Want to get even more genomes per gigabyte? Here's one efficiency measure you might want to consider: use databases of genetic variation. These might be especially useful for large, common insertions (rather than storing the entire sequence of the insertion, you can simply have a pointer to a database entry that stores this sequence).

Acknowledgments: the raw numbers and calculations in this post owe a lot to David Carter, Tom Skelly and James Bonfield from the Sanger Institute and Zamin Iqbal from the European Bioinformatics Institute, UK. Thanks guys!

Subscribe to Genetic Future.

Thursday, June 26, 2008

Gene Essence: what bad personal genomics looks like

[Update 11/01/11: Gene Essence appears to no longer be selling genetic testing, and its website has been replaced by a holding page.]

I mentioned yesterday that one of the companies recently targeted by the California health department was an outfit called Gene Essence. Gene Essence is a 23andMe-style genome scan service launched back at the end of March (covered by Hsien) by Biomarker Pharmaceuticals, a company that aims to develop "genomic and proteomic aging-intervention technologies" to "provide interventions that will extend a healthy human lifespan by slowing the process of aging, and delaying the onset of age-related diseases."

I haven't heard anything about Gene Essence since its launch was announced, so I was curious to see how the service had evolved. The first signs were good: the company is using the Affymetrix 6.0 SNP array, a solid platform that provides information on around 1 million genetic variants throughout the genome - although I can't find any information on where the testing is actually being performed.

Unfortunately, it's all downhill from there. The company thoughtfully provides a demo report allowing customers to see what they'll be getting for their $1,195 - and based on what that shows me I can't imagine anyone purchasing this service, especially given that both 23andMe and deCODEme both offer infinitely superior products at a lower price.

To see just how bad Gene Effects' service is, first check out the 23andMe and deCODEme demo versions, here and here. Now take a look at the demo page where Gene Effects displays your "genetic susceptibility" for a set of different common diseases. Apparently the demo sample has a genetic susceptibility of 100% for atherosclerosis; but does that mean he is 100% likely to get atherosclerosis, that he has the maximum possible genetic risk for the disease, or simply that he has the riskier version all of the known polymorphisms associated with the disease? It's impossible to tell from the page itself - and if you click on the bar for that condition you end up on this absolutely incomprehensible "detail page", which is no help at all.

OK, so let's try reading the manual. The "How to Read This Report" page is unhelpful, but the "Sample Reports" page provides some useful detail - and after a bit of digging it becomes clear that the "genetic susceptibility" score is an indicator of how many of the known risk variants a person carries, scaled by the relative effect of each variant on disease risk. We're supposed to be using the "adjusted trend" column, which "
takes into account the fact that for each condition, a different proportion of the population will have a genetic trend value lower than yours". This doesn't make it any clearer what this actually means in terms of risk prediction; but rather than provide a useful clarification the page goes on to lay out a series of generic disclaimers (e.g. "SNPs are simply markers for the disease or condition and do not necessarily carry any predictive value in terms of assessing one’s susceptibility to a given disease").

In other words, the company provides a series of alarming and confusing predictions, and then simply tells you they don't necessarily mean anything.

There's no estimate of the individual's actual risk of the disease (as you would find in a 23andMe, deCODEme or Navigenics report), no indication of the fraction of the total variance in disease risk that is captured by the polymorphisms in your report, and no reference to whether the described associations are actually reliable and well-validated (for instance, it gives you information on 14 different variants associated with bipolar disease, not one of which has actually been independently replicated). That renders the information essentially meaningless and potentially seriously misleading for the typical customer.

By this stage anyone that Gene Essence somehow convinced to purchase their product would be desperately looking for a way to get some return on their $1,195 purchase. Well, they could always download their raw data and run it through Promethease, which would give them access to the information on each of their genetic variants from the public SNPedia database - except when they click on the "Complete SNP Data" page they'll find no straightforward way to download their entire data-set, but rather a box in which they can type a SNP identification number and download their genotypes one by one. A million of those is going to take some time, even if this feature was actually working (which it isn't, at least for the demo account).

In summary: based on the demo (which is all a potential customer would have to go on) this service is currently seriously bad; I find it hard to imagine that any potential personal genomics customer would pick it over its vastly more professional-looking, rigorous and user-friendly rivals. The fact that Gene Essence actually charges more for its service than 23andMe is utterly absurd. This is an example of what happens when a company tries to jump on the genome scan bandwagon without investing sufficiently in the knowledge-base and user interface required to present extremely complex data to a customer with a limited understanding of genetics.

This hardly inspires any further confidence in the company:
A person who declined to identify herself and who answered the phone at the number listed on Gene Essence's Web site said she didn't know anything about that business or Robert Danielzadeh, identified by the state as its chief executive.
I suggested yesterday that the best outcome in California might be a compromise, in which respectable personal genomics companies are allowed to continue operating (with slightly increased regulatory oversight) while amateurish efforts like Gene Essence slide quietly out of existence. While I don't agree with Steve Murphy that a doctor's explicit permission should be required to authorise a genome scan, there should be a minimum standard of validation and information reporting - and the current version of Gene Essence falls well below that standard.


Subscribe to Genetic Future.

Wednesday, June 25, 2008

Some surrender, some fight on: genetic testing companies respond to the California letter

The health department prepares to storm 23andMe headquartersYesterday marked the deadline set by the California Department of Public Health for thirteen genetic testing companies to halt direct-to-consumer marketing and demonstrate their compliance with state regulations. Briefly, the state insists that genetic testing be conducted using specially licensed laboratories, and - more controversially - that all tests be ordered through a clinician (a proposal that stirred up howls of outrage from the human genetics blogosphere).

All thirteen targeted companies are now listed on the health department website. In addition to the "big three" genome-scan companies (23andMe, deCODEme and Navigenics) and the boutique genome-sequencing company Knome (which offers a whole genome sequence for a cool $350,000) there is a bizarre menagerie of less well-known companies. Three of them (Salugen, Sciona and Suracell) are members of the maligned "nutrigenetics" industry, while New Hope Medical Center appears to be a generic unconventional medicine provider that's dabbling in genetics. Gene Essence is a weak attempt by Biomarker Pharmaceuticals to jump on the genome-scan bandwagon. DNATraits, which I've mentioned previously, and Portugal's CGC Genetics both market tests for a variety of rare single-gene diseases. Finally there are two specialised testers: HairDX, which tests polymorphisms associated with hair loss, and Smart Genetics, which focuses on APOE testing for Alzheimer's risk.

You can download the letters from the health department's website, although they're hardly riveting reading. I'm surprised to see deCODEme on the list since residents of California (and several other states) are explicitly restricted from using the service's genetic risk calculator. (As an irrelevant aside, I imagine that 23andMe's Linda Avey was slightly annoyed by the consistent misspelling of her last name, and the burly CEO of deCODE would have been less than impressed with being referred to as "Ms. Stefansson"!)

The responses to the health department letter have ranged from immediate surrender to preparations for battle. Wired reports that both baldness specialist HairDX and nutrigenetics company Sciona have dropped out of the direct-to-consumer market in California rather than face the legal wrath of the CDPH; SeqWright, which offers a pretty half-hearted genome scan service, packed its bags without even receiving a letter. In contrast, 23andMe and Navigenics - which both presumably have the financial backing to sustain a legal wrangle - are planning to stay. Wired reports on a statement released to them by San Francisco-based 23andMe, which clearly has no intention of losing a foothold in its home state:
We believe we are in compliance with California law and are continuing to operate in California at this time. Our testing is conducted in an independent CLIA-certified laboratory and we utilize the services of a California licensed physician. However, we would like to have continued discussions with the Department regarding the appropriate regulation of this unique industry.
23andMe only recently moved to a lab that was certified under the Clinical Laboratories Improvement Act of 1988 (CLIA), presumably as an attempt to head off this sort of regulatory crack-down. So basically, they're arguing that they're not in breach of the rules and will continue to test California residents until the law changes.

Navigenics, meanwhile, has sent a detailed defence of their position to the health department - and it starts with a much more audacious argument:
In a letter sent to the Health Department obtained by Wired.com, the company argues that it does not actually perform genetic tests, and therefore should not be regulated as a clinical laboratory under California state law.

Instead, Navigenics argues it merely applies algorithms to DNA data it receives from tests performed by a third-party, a licensed laboratory.

[...]

"Nothing in the definition of a clinical laboratory test supports a conclusion that the interpretation of the data resulting from such a test is itself a test," Navigenics wrote in its response.
Because the company out-sources the actual genome scan to a certified lab run by Affymetrix, it never actually touches the DNA - it only handles the information derived from that DNA. Wired comments:
Navigenics is arguing that once the state-licensed lab turns a biological sample into digital data, DNA is no longer within the purview of health department laboratory regulation. Navigenics is just an information service, combining scientifically-published genetic disease correlation data with personal genotype data. [my emphasis]
Apparently Navigenics' lawyers think this intriguing argument has a chance of winning over the health department; but just in case that doesn't work, they have a more conventional second line of defense that sounds pretty similar to 23andMe's argument above. Navigenics has been using a CLIA-certified lab since it launched in April and has emphasised its use of a pet physician who rubber-stamps every ordered test, which it feels is sufficient to place it on the right side of Californian regulations.

We'll see what the health department makes of these arguments. I can imagine several ways this might all pan out (not being a lawyer, I won't pretend I have much idea how likely any of them are).

Firstly, the more respectable, well-backed companies (the big three and Knome) may well be able to hack out a compromise deal with the department and continue operating under only slightly more stringent regulations, while the bottom-dwellers (e.g. the nutrigenetics companies) are forced to move their business to more accommodating regulatory climates. This is probably the best outcome we can hope for - consumers still have the freedom to analyse their own genome without asking their doctor's permission, the department walks away with a sense that the industry has been cleaned up, and the science-free scammers that blight the industry are dealt an important blow.

Secondly, insanely, the department could insist that customers need face-to-face consultation with a real live doctor to order any genetic tests. If they stuck to their guns, you could basically kiss goodbye to the fledgling personal genomics industry in California: faced with the obstacle of seeking a doctor's approval to have their scan (rather than five minutes on the web with a credit card) most potential customers simply won't bother. If this legislation spread to other states the industry as a whole would be set back by years.

Finally, the department might only push for very strong regulation of any health-related genetic test but leave other forms of genetic testing (ancestry and non-disease traits) alone. The result of this would be ironic: Navigenics, which adopted a serious clinical-centred demeanour from the beginning to steer clear of charges of frivolity that might bring regulatory attention, would be hit much harder than the more relaxed 23andMe. Navigenics has very deliberately avoided doing anything that looked remotely like 23andMe's brand of "recreational genomics" - but that has left them with nothing to fall back on should regulation clamp down hard on health-focused genetics. 23andMe, on the other hand, could probably get by purely on the basis of the non-disease components of their package.

Anyway, the next week will be a pretty interesting window in the young life of the personal genomics industry - the decisions made now will likely have major effects on the way the industry evolves over the next few years, in the lead-up to the era of cheap whole-genome sequencing. Stay tuned...

(Thanks to a reader for pointing out the "Ms Stefansson" reference.)


Subscribe to Genetic Future.

Tuesday, June 24, 2008

The market for personal genetic testing

Helix Health's Steve Murphy argues, on the basis of a recent poll of 550 "upscale business professionals":
Only [those who respond that they are "very likely"to get a genetic test for disease risk "in the next few years"] will get the test [...] So, I remain certain, the market for these tests is 5%
Right now, public awareness of direct-to-consumer genetic testing is pretty minimal. That's changing fast, though: anyone tracking Google news alerts will have seen the recent spikes in mainstream media coverage of the genetic testing industry, particularly 23andMe. As people hear more and more about genetic testing it will cease being scary and new, and (like IVF or screening for Down syndrome) become familiar and legitimised.

Meanwhile, over the next five years two things will happen: (1) the massive decrease in the cost of sequencing will bring large-scale genetic analysis within the grasp of the average upper middle-class consumer; and (2) our understanding of the genetics of common disease will increase exponentially, rapidly increasing the clinical (and recreational) value of genome sequence data. Thus the cost-to-benefit ratio of personal genomics will shrink incredibly quickly,even as frequent media coverage creates wider familiarity with the concepts and jargon of genetic testing, and simultaneously a host of influential early adopters publicly discusses their genetic testing experiences.

Basically, the medical and social benefits of having your genome sequenced will make this option steadily more attractive while declining costs will make it steadily more affordable. That's a recipe for a market explosion.

So, let's grant Steve's argument that the true size of the market right now is 5% of "upscale business professionals". That's going to be a pretty damn healthy market once Oprah starts talking about how 23andMe changed her life and the guy two cubicles down from you is bragging about his new genome sequence.

Steve isn't getting this, though, I suspect because he keeps seeing it through his clinician prism. For instance, he claims:
It is just like a referral to see another doctor.....if you aren't feeling ill, only the very likely will ever go see that specialist.....It is called the attrition rate and is commonly understood in medical care......only 20% of your "presymptomatic ill" ever go see the referral.
That's because standard medical testing is so incredibly boring. Genome scans and sequences, on the other hand, are cool. They're based on fancy new technology; you've read about them in Wired; and they tell you interesting things about yourself. People will want to take these tests.

And, if the medical establishment gets its act together and starts proving that clinicians can actually value-add to genetic health data (rather than pissing potential clients off with regressive regulations), people will want to take their test results to their doctor to put them in a broader health context. That's win-win - but only if doctors do things the right way.


Subscribe to Genetic Future.

Friday, June 20, 2008

The adventure gene

Last week I mentioned a study suggesting that a genetic variant associated with attention deficit disorder (ADD) is beneficial for individuals living in unsettled, nomadic groups but detrimental to those in modern sedentary societies.

The Economist has an interesting article on the study that includes the figure on the left, which is based on data from a 1999 study of 2,320 individuals from 39 populations. Basically, it shows the distance that each population has migrated over the last 1-30,000 years on the x-axis, and the frequency of DRD4 "long alleles" (version of the gene closely related to the ADD-associated version) on the y-axis.

There is an intriguing correlation, suggesting that the "novelty-seeking" behaviour associated with ADD may have extended into a desire to explore new territories.

Razib has been discussing DRD4 quite a bit recently over on GNXP, including a potential gene-society interaction influencing political beliefs.

(Thanks to Simon, who needs to start his own blog ASAP.)


Subscribe to Genetic Future.

Demonstrate evolutionary innovation, win $20,000

The InnoCentive website is a kind of ideas marketplace where "Seekers" pose detailed descriptions of theoretical or technical problems in their field and offer financial incentives for "Solvers" to figure them out.

One of this week's challenges would be of interest to all the theoretical and experimental evolutionary biologists out there, especially given the $20,000 reward money on offer:
During the evolution of life on Earth new biological features have emerged in a process that continues without reaching an obvious maximum level of organized complexity, even now. The Seeker is interested to know if this apparently open-ended evolutionary innovation would be possible in a quarantined system. If so, can it be demonstrated? The goal of this theoretical challenge is to come up with an acceptable demonstration design to provide a positive result.

The Challenge – Design the best way to achieve a demonstration of open-ended evolutionary innovation (OEEI) in a quarantined system (QS) whose outcome can be judged without dispute. The demonstration may be proposed to take place in a biological culture, a computer model, or any medium that can do the job.
You'll need to register as a Solver to read the details of the challenge (which requires agreeing to not disclose the information therein). The Seeker has also set up a Google Groups page to discuss the issue, where someone has already suggested that this process was well-demonstrated a fortnight ago by Richard Lenski's group.

Of course, it all depends on your definition of "innovation" - and I hope I'm not crossing any legal boundaries by saying that the Seeker's definition in the detailed challenge is far from satisfying.


Subscribe to Genetic Future.

Thursday, June 19, 2008

Wired on regulation of DTC genetic testing

The Wired website is abuzz with news and opinions regarding the California Public Health Department's recent letter to direct-to-consumer genetic testing companies, which advised them to cease testing customers until they have demonstrated compliance with state regulations - importantly, including the provision that all "clinical laboratory tests" (apparently including a genome scan) be ordered through a doctor.

Such regulation goes against the general Wired philosophy of unfettered access to technology, so the response is not positive. Wired began the day with a list of the Top 10 Reasons that Regulators Should not Hinder Genetic Testing, followed with an exclusive release of the letter that started it all, a report on a 3-hour health department conference call on the issue last Friday, and then a summary of the whole debacle.

I'm not going to go on about this topic (this will hopefully be my last post on it for a while) but the links above are well worth a scan for anyone concerned about the future of the personal genomics industry.


Subscribe to Genetic Future.

Wednesday, June 18, 2008

Cat-fight over California



A couple of days ago I mentioned California's regulatory smackdown of direct-to-consumer genetic testing companies, and the triumphant response of Helix Health's Steve Murphy to the news. Steve has long argued that physicians need to play a role in any genetic testing with potential health consequences - unsurprisingly, given that this is exactly what his business model depends on. California's regulatory response struck Steve as exactly what the doctor ordered:
That's right, in order to test, you are soon going to have to document informed consent and evaluation by physician or extender. So anyone not doing what Helix Health is doing will soon find themselves on the wrong side of the law....again.
Well, Steve's tone certainly got him some attention. In the comments to his post, Steve's response to the news was described as "rent-seeking" and "arrogant"; and today, Wired blogger Thomas Goetz criticises Steve's "paternalistic tone" and argues strongly against a physician-centred vision for the future of the personal genomics industry:
Having been tested by both 23andMe and Navigenics, I can say that, yes, it's complicated. But frankly I don't need a doctor, and I don't want a doctor, to facilitate my understanding of what my DNA means.

[...]

This is not a dark art, province of the select few, as many physicians would have it. This is data. This is who I am. Frankly, it's insulting and a curtailment of my rights to put a gatekeeper between me and my DNA.
Deepak Singh from bbgm is similarly outraged by the regulators, given the ignorance of the average medico in matters genetical:
This is my data and it’s my decision. Regardless of what you think about the services and their utility or lack thereof, it’s ludicrous to think that doctors, most of whom know less about genetics than I do, need to make a call on this. It’s a personal decision. [my emphasis*]
I naturally side with the freedom-of-genetic-information folks on this issue. As I said in my last post (in a comment that was echoed by David Ewing Duncan in Wired), what we're seeing here is to some extent a turf war, with the medical establishment trying to use legislation to claw back deference and power that used to be theirs automatically. Unfortunately for them, we're already living in a world in which people justifiably feel a strong sense of ownership over their genetic information; that's not going to change, and doctors will simply have to adapt to the new rules of play.

However, at the same time I want to emphasise the dangers of heading too far in the opposite direction. Three hours reading the 23andMe website and Googling "type 2 diabetes" does not make you an expert on this disease. Seriously - they may not know what a SNP is, but when it comes to diseases doctors know stuff that you can't get from Google.

Now, I don't believe that I should need a doctor's permission to sequence my own genome, and I see no need for laws compelling me to see one afterwards. Nonetheless, it is crucial that readers don't walk away from this discussion feeling that the medical profession is suddenly irrelevant to the shiny, high-technology era of health genomics. Clinicians will be absolutely critical when it comes to making personal genomics medically useful - or at least they will be once the medical establishment stops trying to make itself relevant through legislation, and instead focuses on providing doctors who are capable of engaging with genomic data.

If they manage this, in a few year's time nearly every personal genomics customer will be reviewing their sequence data with a medical professional: not because the law says they have to, but because they want to.

It's unfortunate that Steve's clear conflict of interest undermines his credibility here (as most of his critics have pointed out, some savagely) since his position is not baseless: clinicians do have an important role to play. He just needs to realise that brute legislative force is not the right way to achieve this end.


* It's worth noting that the knowledge of genetics of most of the people commenting on this topic (certainly including Deepak) is waaaaay better than the average personal genomics customer - I think we all need to carefully calibrate our outrage to allow for the fact that the regulators are aiming at people who know much less about this stuff than the typical genetics blogger does. We should still be annoyed; just slightly less than we actually are.


Subscribe to Genetic Future.

Tuesday, June 17, 2008

Finer mapping of European ancestry using personal genomic data

Dienekes has put together a neat little tool that uses variation data from either 23andMe or deCODEme to estimate whether a subject's ancestry stems mainly from Northwest European, Southeast European, or Ashkenazi Jewish populations, using a set of published markers that are strongly differentiated between these population clusters. There's more detail on how the tool operates here; to use it, you'll first need to download the R software environment from here.

Like the Promethease tool, this is a great example of how the human genomics community can help genetic testing customers wring the most out of the rich, complex trove of data in their own genome.


Subscribe to Genetic Future.

Growth in commercial disease gene tests

From GeneTests, via OpenHelix:

Lab_Test_Growth

That's pretty impressive, but just wait: within the year, next-generation sequencing will make it commercially viable to sequence thousands of genes at once; and a year or two after that it will be more cost-effective to just sequence the entire genome and be done with it, effectively maxing out the curve at the level of "every disease for which the causative genes are known". After that, the rate of growth in the curve will be determined by the rate at which new disease genes can be discovered and characterised.

You can see that the number of testing laboratories is already plateauing - it will be interesting to see how quickly this declines over the next few years as large sequencing-based companies either engulf the boutique specialist laboratories or drive them out of business. There are ruthless economies of scale in the human disease genomics business, both in terms of sequencing infrastructure and the costs of assembling reliable knowledge bases for interpretation, so it will be increasingly difficult for smaller companies to stay competitive.


Subscribe to Genetic Future.

Sunday, June 15, 2008

California cracks down on genetic testing companies

California has sent cease-and-desist letters to 13 direct-to-consumer genetic testing companies, "ordering them to immediately stop offering genetic tests to state residents". The companies haven't been named yet, although Navigenics has admitted to being among them, and is arguing that they are doing nothing wrong. Steve Murphy of Helix Health, who has been calling for tighter regulation of these companies for months, is predictably triumphal.

Naturally, this is just the beginning - to a large extent what's going on here is a turf war between proponents of the old-school medical regulation model and upstart advocates of the free information paradigm of the Google generation. Expect to see more regulatory punches thrown over the next few months, to Steve's continued delight.


Subscribe to Genetic Future.

Wednesday, June 11, 2008

The 1000 Genomes Project: battle-ground for next-gen sequencers

The 1000 Genomes Project is an ambitious international venture, launched back in January, that seeks to leverage advances in DNA sequencing technology to create a map of human genetic variation with unprecedented resolution.

The formal scientific goal of the project is "to produce a catalog of [genetic] variants that are present at 1 percent or greater frequency in the human population across most of the genome, and down to 0.5 percent or lower within genes". This will involve the generation of mind-boggling amounts of sequence data: according to Paul Flicek from the European Bioinformatics Institute, who has been coordinating data storage and transfer for the project, the final data-set will probably take up close to 1 Pb (i.e. one million gigabytes)!

The scale of this project has only become feasible over the last year or two with the appearance of so-called "next-generation" sequencing platforms: new technologies that are capable of reading billions of bases of DNA in a single run, over just a few days. Although the field is rapidly heating up, right now there are three commercial platforms in the market: the 454 system provided by Roche, Applied Biosystem's SOLiD technology, and Illumina's Genome Analyzer (previously known as Solexa). Each of the platforms has its positive and negative points, but those aren't relevant for this post: suffice it to say that there is currently a serious arms race between these three products, with each of the respective companies eyeing the lucrative medical sequencing market just around the corner.

This arms race appears to be great news for the 1000 Genomes Project. As an Applied Biosystems press release spells out today, the major companies are literally giving away sequencing capacity to the Project in return for product exposure - all three of them have now committed to sequencing 75 billion bases, the equivalent of sequencing a complete human genome 25 times over. In the press release, Project steering committee co-chair Richard Durbin says:

"It is a win-win arrangement for all involved. The companies will gain an exciting opportunity to test their technologies on hundreds of samples of human DNA, and the project will obtain data and insight to achieve its goals in a more efficient and cost-effective manner than we could without their help."

The stakes are extremely high here. Involvement in the 1000 Genomes Project gives these companies an opportunity to prove their technology to the researchers in major sequencing facilities who will ultimately be some of their biggest customers, while at the same time providing some valuable public relations stories - Illumina got some good coverage with their "first African genome" back in February, for instance. The platforms that can prove themselves early in the game will have a serious edge over later competitors, simply by being well-established in the large facilities by the time young upstarts like Pacific Biosciences even have a product on the market. And when you think about just how much money there will be in the medical sequencing business within the next few years, Durbin's "exciting opportunity" starts to seem like classic British understatement.

Anyway, you can expect things to get pretty serious over the next few months, as the involvement of all three of the platforms in the 1000 Genomes Project (both as providers of free sequence, and in terms of machines in the participating facilities) provides a level playing field for comparing the throughput, accuracy, ease of use and running costs of the three competitors, engaged in the sequencing activity that matters the most right now: assembling entire genomes from human beings, as cheaply, accurately and quickly as possible.

Meanwhile, those of us interested in personal genomics get to sit back and watch the fireworks, knowing that it doesn't really matter who wins this race - so long as it's being run, our genome sequences are becoming steadily more affordable.


 Subscribe to Genetic Future.

Tuesday, June 10, 2008

The adaptive origins of attention deficit disorder

Razib from Gene Expression describes a potentially fascinating study on a variant of the DRD4 gene, which was first shown to be associated with attention deficit hyperactivity disorder (ADHD) more than ten years ago. (It's worth emphasising, by the way, that DRD4 is just one of the many genes likely to be involved in this complex trait). Interestingly, the same variant has also been reported to show a signature of recent positive selection in some human populations, suggesting that the behavioural "problems" displayed by modern individuals with ADHD may actually result from a mis-match between the environment our hunter-gatherer ancestors were adapted to and the bizarre, restrictive environment of Homo suburbanensis.

The full article hasn't actually been released yet, so we're all forced to play the now-familiar game of "science by press release" based on an article in ScienceDaily. Apparently, researchers directly tested the mis-match hypothesis by looking at the effects of the ADHD version of the gene on body mass index, a crude measure of nutrition levels, in men from the Ariaal tribe in Kenya. Some of the members of this tribe are nomadic while others live in settled communities. In agreement with the predictions of the mis-match hypothesis, those with the ADHD version of DRD4 were fatter (i.e. healthier) in nomadic populations, but skinnier (unhealthier) in settled groups.

The explanation of the results by the study's lead author:

"The DRD4/7R allele [i.e. the ADHD version of the gene] has been linked to greater food and drug cravings, novelty-seeking, and ADHD symptoms. It is possible that in the nomadic setting, a boy with this allele might be able to more effectively defend livestock against raiders or locate food and water sources, but that the same tendencies might not be as beneficial in settled pursuits such as focusing in school, farming or selling goods".

In other words, behaviour that would result in a rapid trip to the headmaster (followed by the psychiatrist) in stable, industrialised society may actually have been extremely useful in the relatively uncertain world of the hunter-gatherer tribesman.

Of course, simply knowing that ADHD is "natural" doesn't necessarily make it any easier to solve the broader problem of how society should be dealing with individuals with ADHD. In an ideal world, education would be tailored to the unique learning demands of individual students, resulting in the maximisation of each child's potential skills. However, in a world with limited resources for education, is society's current approach (medicate them until they shut up and learn like the other kids) the only workable solution? Or can we figure out a way to restructure society such that "obsolete" ADHD tendencies become useful again?


Subscribe to Genetic Future.

Sunday, June 1, 2008

Brain scanning vs personal genomics

Personal genomics companies like 23andMe, deCODEme and Navigenics have taken substantial media flak recently over their limited ability to make useful disease risk predictions based on genome scan data.

There's certainly some truth to that accusation, but all three of these companies have been generally good at conveying this uncertainty to their customers. In particular, I've been amazed by the tendency of deCODEme to play down the usefulness of their tests in terms of disease prediction. I've previously mentioned deCODE's Kari Stephansson's admission at Cold Spring Harbor Laboratory that deCODEme is "marketing these tests without any claim that they will impact on people's lives"; a couple of weeks ago I attended a seminar on personal genomics in Cambridge, UK, where deCODEme's Agnar Helgason volunteered that "what we can offer at the moment is pretty meagre". Navigenics and 23andMe tend to avoid such frank admissions, but their predictions are still very carefully phrased in statistical terms.

In any case, the accuracy of predictions based on personal genomics starts to look much more impressive when it's compared to some of the other 'science-based' prediction industries out there. A recent article in Wired has a fairly scathing review of one such field: the use of functional brain scans to predict risks of mental illness, personality traits, dishonesty, political views and consumer behaviour.

Given the $3,300 price tag for one of the services on offer (from a company that is "dedicated to optimizing the brain-life connection in our patients and people worldwide", according to their website), this is a pretty expensive piece of foolishness. No doubt some patients benefit from the service, but it's likely that they would have gained just as much from a visit to a good counsellor without the fancy brain scans, at a small fraction of the cost.

Long-term readers will know that I don't recommend current genome scans - my suggestion is that potential customers save their money for a few years, by which time large-scale sequencing will be affordable, and we will know much more about disease-associated genetic variants - but if I had to pick a fancy technology to waste my money on I'd go with a genome scan over a brain scan any day.


Subscribe to Genetic Future.