Amazon Web Services and the US National Institutes of Health Announce the Largest Catalog of Human Genetics is Now Available in the Cloud

Amazon Web Services and the United States National Institutes of Health (NIH) announced at the White House Big Data Summit that the complete 1000 Genomes Project is now available on AWS as a publicly available data set. Today’s announcement makes the largest collection of human genetics available to researchers worldwide, free of charge. The 1000 Genomes Project is an international research effort coordinated by a consortium of 75 companies and organizations to establish the most detailed catalogue of human genetic variation. The project has grown to 200 terabytes of genomic data including DNA sequenced from more than 1,700 individuals that researchers can now access on AWS for use in disease research. The 1000 Genomes Project aims to include the genomes of more than 2,600 individuals from 26 populations around the world, and the NIH will continue to add the remaining genome samples to the public data set this year. To access the 1000 Genomes Project Data, visit http://aws.amazon.com/1000genomes .

The National Institutes of Health is part of the U.S. Department of Health and Human Services, and serves as one of the data coordinators for the 1000 Genomes Project. “Previously, researchers wanting access to public data sets such as the 1000 Genomes Project had to download them from government data centers to their own systems, or have the data physically shipped to them on discs. This process took a long time, and that’s assuming a lab had the bandwidth to download the data and sufficient storage and compute infrastructure to hold and analyze the data once they had it,” said Lisa D. Brooks, Ph.D., Program Director for the Genetic Variation Program, National Human Genome Research Institute, a part of NIH. “We are happy that the 1000 Genomes Project data are on AWS to give researchers anywhere in the world a simple way to access the data so they can put the data to work in their research.”

“Putting the data in the AWS cloud provides a tremendous opportunity for researchers around the world who want to study large-scale human genetic variation but lack the computer capability to do so,” said Richard Durbin, Ph.D., co-director of the 1000 Genomes Project and joint head of human genetics at the Wellcome Trust Sanger Institute, Hinxton, England.

Public Data Sets on AWS provide a centralized repository of public data stored in Amazon Simple Storage Service (Amazon S3) and Amazon Elastic Block Store (Amazon EBS). The data can then be directly accessed from AWS services such as Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic MapReduce (Amazon EMR), eliminating the need for organizations to move the data in house and then procure enough technology infrastructure to analyze the data effectively. AWS’s highly scalable compute resources are being used to power big data and high performance computing applications such as those found in science and research. NASA’s Jet Propulsion Laboratory, Langone Medical Center at New York University, Unilever, Numerate, Sage Bionetworks and Ion Flux are among the organizations leveraging AWS for scientific discovery and research. AWS is storing the public data sets at no charge to the community. Researchers pay only for the additional AWS resources they need for further processing or analysis of the data.

“It took more than 10 years, and billions of dollars to sequence and publish the very first human genome. Recent advances in genome sequencing technology have enabled researchers to tackle projects like the 1000 Genomes by collecting far more data, faster. This has created a growing need for powerful and instantly available technology infrastructure to analyze that data,” said Deepak Singh, Ph.D. and Principal Product Manager, Amazon Web Services. “We’re excited to help scientists gain access to this important data set by making it available to anyone with access to the Internet. This means researchers and labs of all sizes and budgets have access to the complete 1000 Genomes Project data and can immediately start analyzing and crunching the data without the investment it would normally require in hardware, facilities and personnel. Researchers can focus on advancing science, not provisioning the resources required for their research.”