You may have heard stories about some well known people to have released their genome for public use. I would like to convince you that now you don’t have to have a lot of money or being a public figure in order to do that. Companies like 23andMe and Navigenics provide the ability to get one’s genome tested for not a lot of money and get the results via a password protected website. The problem is that our current understanding of what these results mean are rather limited on their own. Thus having open collaboration platforms for citizen science using genomic data may be a step forward in helping understand one’s genetic testing results. Initiatives like DIYgenomics are already working on this concept.
You may wonder why making one’s genome released is useful. The answer is, in practical terms it is not. However, the concept of being able to do that I consider it to be a very interesting one. After all, one’s genome data on its own is hardly informative, but when compared with information like known genes, pathways or even other people’s genomes, it becomes much more interesting and opens up the possibility for real discoveries.
With this post I hope to prove that genomes can now be put on the web in a standard format like the Distributed Annotated System (DAS) where people can share and integrate them with other public data sources mappable to genome coordinates. DAS is an environment that is open source, decentralized and unregulated. So what is different here from what is being done already? Why is this significant? I can think of at least three reasons. 1) Flexibility: pretty much any genome annotation can be put up; 2) Integration capabilities: anything can be combined with anything else as long as they share the same coordinates system and 3) Data outsourcing: data is stored and maintained by DAS source owners elsewhere. Here is my story:
Last year I decided to get a 23andMe kit to have my genome analyzed. After results were delivered, I decided to download the data in raw format, consisting of >0.5M SNPs (single nucleotide polymorphisms) mapped to the NCBI36 genome assembly.
I wanted to experiment with this data from a bioinformatics point of view, so I decided to put my “genome” on the web for public access. Well almost. I did not put up my real genome, I created a randomly shuffled version of it (i.e. it does not resemble any recognizable trace to the real data). I put up this unreal data to make a point of principle.
Anyone in the world can thus access my randomly shuffled genome using an URL like this:
where after the token “segment=” in the above URL a chromosome type is specified [1-22, X, Y], followed by a colon, followed by the start and end position, separated by comma. Try the above URL with different chromosome number and coordinates and see what results you get!
In the above figure you see different columns, denoting the SNP id, start and end positions, the genotype under the “Notes” heading and a link to the SNP’s corresponding entry in dbSNP.
Now that this genome is in a standard format, it can easily be integrated with any other publicly available data in DAS. As of this writing (26th August 2010) there are 139 data sources available in the DAS registry mapped to Human Genome coordinates. I may not be interested in them all, but certainly this is one of the greatest repositories of genomic data in just one shop. Leading providers of publicly available genomic DAS sources include Ensembl, the Database of Genomic Variants and ENCODE. Potential permutations of this data provides a range of possibilities for interrogation of biological hypotheses that is probably unparalled.
Now this shuffled genome is available for public use via a DAS web service. It will probably not be the last one to be put up and soon real 23andMe genomes will follow.