Much like Bar Refaeli and Leonardo DiCaprio, DNA Sequencing and cloud computing go hand in hand together.
I had a very interesting conversation with a friend yesterday about DNA Sequencing and cloud computing.
My friend is leading one of the largest cancer genome research projects in the world (and yes, he is extremely bright).
It appears that there is a great progress in DNA sequencing technology, based on chemical process. The pace is much faster than Moore’s law. As a result the budgets are shifting from the chemistry side to the computational side.
In the past, the budget would be 90% for biology and 10% for analyzing the data coming our of the DNA.
As the sequencing costs have fallen by orders of magnitude there is more and more data ( a single patient genome data is one TeraByte).
The more data , the more computing power needed to analyze it and hence the budget split becomes 50-50.
Each computation can take up to 24 hours, running on 100 cores mini grid.
In theory, such tasks are great for cloud computing IAAS (Infra Structure as a Service) platforms or even PAAS (Platform as a service) solutions with Map-Redux capabilities.This EC2 Bioinformatics post provide interesting examples.
In practice there are three main challenges
- Since Cancer research facilities need this server power everyday, it is cheaper for them to build the solutions internally.
- To make things even more challenging, the highest cost in most clouds is the bandwidth in and out of the cloud. It would cost $150 to store one patient data on Amazon S3, but $170-$100 to transfer it into S3.
- Even if the cost gap can be mitigated, there can be regulatory problems with privacy of patients data.After all its one person entire DNA we speak about. Encryption would probably be too expensive, but spiting and randomizing the data can probably solve this hurdle.
So, where do clouds make most sense for this kind of biological research ?
One use case is the testing of new improved algorithm. Then, the researchers want to run the algorithm on all the existing data, not just the new one.
They need to compare the results of the new algorithm with the old algorithms on same data set.They also need to finish the paper on time for the submission deadline .
In such scenarios there is a huge burst of computation,needed on static data, at a very short period of time.Moreover, if the data can be stored on shared cloud, and used by researchers form across the world, than data transport would not be so expensive in the overall calculation.
These ideas are fascinating and hopefully would drive new solutions, cures and treatments for cancer.