Cloud Computing and the DNA Data Race
In the race between DNA sequencing throughput and computer speed, sequencing is winning by a mile. Sequencing throughput is currently around 200 to 300 billions of bases per run on a single sequencing machine, and is improving at a rate of about fivefold per year. In comparison, computer performance generally follows 'Moore's Law', doubling only every 18 or 24 months. As the gap in performance widens, the question of how to design higher-throughput analysis pipelines becomes crucial. One option is to enhance and refine the algorithms to make better use of a fixed amount of computing power. Unfortunately, algorithmic breakthroughs of this kind, like scientific breakthroughs, are difficult to plan or foresee. The most practical option is to develop methods that make better use of multiple computers and processors in parallel. This presentation will describe some of my recent work using the distributed programming environment Hadoop/MapReduce in conjunction with cloud computing to dramatically accelerate several important computations in genomics, including short read mapping & genotyping, sequencing error correction, and de novo assembly of large genomes. Michael Schatz is a computational biologist and assistant professor in the Simons Center for Computational Biology at Cold Spring Harbor Laboratory. Schatz brings 8 years of experience developing free and open source informatics tools for comparative genomics and genome assembly, which have been applied to genomes across the tree of life including: viruses and microbial genomes, human parasites, plants, fungi, insects, birds, and mammals. In recent years, Dr. Schatz has pioneered the use of the large scale data analysis system Hadoop/MapReduce for DNA sequence analysis, by publishing the first ever and widely recognized algorithms CloudBurst and Crossbow for parallel short read mapping and SNP genotyping. Current projects include identifying de novo mutations associated with autism, and structural variations in esophageal cancer.
Michael Schatz is a new member of the faculty at Cold Spring Harbor Laboratory in the center for Quantitative Biology. His research interests include high performance computing and parallel algorithms design towards problems in computational biology and genomics. He received his Ph.D. and M.S. in Computer Science from the University of Maryland in 2010 and 2008, and his B.S. in Computer Science from Carnegie Mellon University in 2000. More information about Michael's research and publications is available at http://www.cbcb.umd.edu/~mschatz.