Andrew from Think Gene has finally prompted me to write a post I've been working on sporadically for a month or so. The question is pretty simple: in the not-too-distant future you and I will have had our entire genomes sequenced (except perhaps those of you in California) - so how much hard drive space will our genomes take up?Andrew calculates that a genome will take up about two CDs worth of data, but that's only if it's stored in one possible format (a text file storing one copy of each and every DNA letter in your sequence). There are other ways you might want to keep your genome depending on what your purpose is.
The executive summary
For those who don't want to read through the tedious details that follow, here's the take-home message: if you want to store the data in a raw format for later re-analysis, you're looking at between 2 and 30 terabytes (one terabyte = 1,000 gigabytes). A much more user-friendly format, though, would be as a file containing each and every DNA letter in your genome, which would take up around 1.5 gigabytes (small enough for three genomes to fit on a standard data DVD). Finally, if you have very accurate sequence data and access to a high-quality reference genome you can squeeze your sequence down to around 20 megabytes.
The details
For the first two formats I'll assume that someone is having their genome sequenced using one of today's cutting-edge sequencing technologies, the Illumina 1G platform. The 1G platform and its rivals, Roche's 454 and Applied Biosystem's SOLiD, are the instruments that are currently being used to sequence over 1,000 individuals for the international 1000 Genomes Project; if you were to have your genome sequenced right now it would almost certainly be using one of these platforms.
The Illumina technology basically sequences DNA as a huge number of short (36-letter) fragments, called reads. Because read lengths are so short and the system has a fairly high error rate, assembling an entire genome would require what's called 30x coverage - which basically means each base in the genome is sequenced an average of 30 times.
Once the reads have been generated, they are assembled into a complete genome with the help of a universal reference genome, the sequence created by the Human Genome Project from a mixture of DNA from several individuals. Even with this high level of coverage there is still considerable uncertainty involved in the process of re-assembling the genomes from very short fragments, and both the algorithms used to perform this assembly and the reference genome are being constantly improved. Thus for the moment there may be some advantage in storing your data in a raw format, so that in a few month's time you can take advantage of better software and more complete reference genome to reconstruct your own sequence in a more complete fashion.
For the third and fourth formats, I've moved into the future: basically, I'm assuming that we now have access to affordable sequencing technology that can generate extremely long and accurate reads from a single molecule of DNA. That would allow you to reconstruct your entire genome - both sets of chromosomes, one from your mother and one from your father - with very high confidence. In that case you no longer need to store your raw data, and we can instead start thinking about the most efficient possible way to keep your entire genome on disk.
Note that in what follows, for the sake of simplicity I am ignoring the effects of data compression algorithms. It's likely that you could shrink down these data-sets (especially the image files) by quite a bit using even straightforward compression.
Anyway, enough background. Let's get started.
1. For hard-core data junkies only: raw image files
To put it very simply, the Illumina 1G platform sequences your DNA by first smashing it up into millions of fragments, binding those fragments to a surface, and then feeding in a series of As, Cs, Gs and Ts. As these bases are incorporated into the DNA fragments they set off flashes of light that are captured by a very high-resolution camera, resulting in a series of pretty coloured images such as the one on the left(which is actually a montage of four images, one for each base). Each of those spots represents a separate fragment of DNA, captured at the moment that a single base (A, C, G or T, each labelled with a different colour) is read from that fragment. By building up a series of these images the machine accumulates the sequence of the first ~36 bases of those fragments in the image, after which the sequence quality starts to drop off.Almost as soon as these images are generated they are fed into an algorithm that processes them, creating a set of text files containing the sequence of each of the fragments. The image files are then almost always discarded. Why are they discarded? Because, as you will see in a minute, storing the raw image data from each run in even a moderate-scale sequencing facility quickly becomes prohibitively expensive - in fact, several people have suggested to me that it would be cheaper to just repeat the sequencing than to store these data long-term.
How much data? Each tile of an Illumina machine will give you accurate sequence information for around 25,000 DNA fragments. A separate image is obtained for each of the four bases, with each "snap-shot" comprising around 2 Mb of data. That comes to a total of 320 bytes/base. For an entire genome with 30x coverage, that comes out as around 28.80 terabytes of data. That's almost 30,000 gigabytes!
Why store your genome like this? Well, either you believe that image-processing algorithms are likely to improve in the near-future, thus allowing you to squeeze a few more bases out of your data; or you have a huge bunch of data servers lying idle that you want to do something with; or you're just a data junkie. However, your actual sequence data is not readily accessible in this format, so you'd also want to be keeping at least a roughly assembled version of your genome around to examine as new information about risk variants becomes available.
2. For DIY assemblers: storing individual reads
I mentioned above that those monstrous image files are rapidly converted into text files containing the sequence of each of your ~36-base reads. The files that are generally used here are called Sequence Read Format (SRF) files, which are used to store the most likely base at each position in the read along with other associated data (such as quality scores).
How much data? It depends what sort of quality information you keep: at the high end you'd be looking at around 22 bytes/base (1.98 terabytes total) to store raw trace data, while at the low end you could just score sequence plus confidence values for around 1 byte/base (90 gigabytes total). That's starting to become feasible - you could now store your genome data on an affordable portable hard drive.
Why store your genome like this? This is a pretty efficient way to store your raw read data while you wait for improvements in both the reference human genome sequence (which is still far from complete) and assembly algorithms. As with the previous format, though, you'd also want to store your sequence in a more readily accessible assembled sequence so you could actually use it.
3. Your genome, your whole genome, and nothing but your genome
OK, now let's gaze a few years into the future, and assume (fairly safely) that new technologies for generating accurate, long reads of single DNA molecules have become available. This means you can stitch your entire genome together very easily, allowing you to store the whole 6 billion bases of it in a text file - this is the type of data storage approach that Andrew discussed in his post. In essence, you're storing every single base in your genome as a separate character in a massive, 6 billion letter long text file.
How much data? Each DNA base can be stored in two bits of data, so your complete genome (both sets of chromosomes) tallies up to around 1.5 gigabytes of data. If you wanted to store some associated confidence scores for each base (indicating how likely it is that you sequenced that section of your DNA correctly) that might take you up to 1 byte/base, or a total of around 6 gigabytes. Either way, you could now fit your genome on a cheap USB thumb drive.
Why store your genome like this? This is probably the easiest possible format to store your genome in - it contains all the information you need to compare your sequence with someone else's, or to find out if you have that rare mutation in your GABRA3 gene that you saw on the news last night. It's everything you need and nothing you don't.
Now, most sensible people will probably be content with their 1.5 Gb genome, especially as data storage becomes ever cheaper. But a few will want to squash it down further, particularly if they're storing lots of genomes (like a large sequencing facility, or your insurance company). In that case they can go one step further by taking advantage of the fact that at the DNA level all of us are very much alike.
4. The minimal genome: exploiting the universal reference sequence
I don't know who you are, but I do know that if you lined up our genomes you would find that we have a lot in common - almost all of the bases in our genomes are absolutely identical. Indeed, for any two randomly selected humans you will find, on average, that around 99.5% of their DNA is precisely the same (although the precise pieces that are different will of course differ from person to person). We can use this commonality to compress our genomes further using a clever trick: if we have a very good universal human reference sequence, we can ignore all the parts of our genome that match it, and only store the differences.
In practice, then, your personal genome sequence will comprise (1) a header, stating which reference sequence to compare to (this would ideally be a reference sequence from your own ethnic group), and (2) a set of instructions providing all the information required to transform that reference sequence into your own 6 billion base genome.
For convenience, I'll assume that the reference is stored as a single contiguous text file containing all 46 chromosomes joined together. To make your genome, a software package will start at one end of the reference sequence; each instruction will tell it to move a certain number of bases through the genome, and then change the sequence at that position in a specific way (it could either change that base to something else, insert new sequence, or delete the base entirely). In this way, sequentially running your personal instruction set will convert the reference sequence into your own genome, base by base.
How much data? This one is tricky because we still don't have a great idea of exactly how many differences exist between people. I'm going to make some rough guesses using this paper, which compares the genome of Craig Venter to the sequence generated by the Human Genome Project, and assuming that this paper under-represents the total number of variable sites by around 30% (due to missed heterozygotes and poor coverage of repetitive areas). For a diploid genome (i.e. one containing two copies of each chromosomes, one from each parent) this gives an estimate of around 6 million single base polymorphisms, about 90,000 polymorphisms changing multiple bases, and about 1.8 million insertions, deletions and inversions.
Now, the instructions for each of the single base polymorphisms can be stored as 1.5 bytes each on average (enough space to store both the distance from the previous polymorphism, and a new base). Multiple-base polymorphisms will be perhaps 2 bytes each, allowing for the storage of a few additional changed bases. Deletions might be around 3 bytes to store the length of the deleted region. Insertions will be more complicated: if they simply duplicate existing material they might only take up 3 or 4 bytes, but if they involve the insertion of brand new material they will be much larger (3 bytes, plus 1 byte for every 4 new bases inserted). From the Venter genome the average insertion size is 11.3 bases, so let's say insertions take up 7 bytes on average.
Making some more assumptions and tallying everything up I get a total data-set on the order of 20 megabytes. In other words, you could fit your genome and the sequences of about 34 of your friends onto a single CD.
Why store your genome like this? If you have a fetish for efficiency, or if you have a whole lot of genomes you need to store, this is the system to use. Of course, it relies on having access to a universally accessible reference sequence of high quality - and you would probably want to recalculate it whenever a new and better reference became available.
Squeezing your genome even further
Want to get even more genomes per gigabyte? Here's one efficiency measure you might want to consider: use databases of genetic variation. These might be especially useful for large, common insertions (rather than storing the entire sequence of the insertion, you can simply have a pointer to a database entry that stores this sequence).
Acknowledgments: the raw numbers and calculations in this post owe a lot to David Carter, Tom Skelly and James Bonfield from the Sanger Institute and Zamin Iqbal from the European Bioinformatics Institute, UK. Thanks guys!







