BIG DATA
Cornell Expands Database for Plant Proteomics
Cornell University announced the expansion of the Plant Proteome Database (PPDB) to include the whole plant proteome. “Most cellular functions are carried out by proteins,” said Klaas J. van Wijk, Cornell Associate Professor, Plant Biology, “therefore, knowing the complete set of expressed proteins, their subcellular localization and interactions is essential.”
Proteomics—the systematic analysis of large sets of proteins—relies on mass spectrometers, combined with the availability of sequenced genomes and modern bioinformatics tools. Initiated as a joint project of the Klaas J. van Wijk Lab and Qi Sun of the Cornell Computational Biology Service Unit (CBSU), the Plant Proteome Database provides scientists with an integrated resource for experimentally identified proteins in the key species of maize and Arabidopsis. The PPDB bioinformatics infrastructure was generated using the resources of the Cornell Center for Advanced Computing (CAC).
“Internal BLAST alignments link maize and Arabidopsis information,” explained Sun. “Experimental identification is based on in-house mass spectrometry of cell type-specific maize proteomes, or specific subcellular proteomes such as chloroplasts, thylakoids, nucleoids, and total leaf proteome samples of maize and Arabidopsis.” So far more than 5000 accessions both in maize and Arabidopsis have been identified. In addition, more than 80 published Arabidopsis proteome data sets from subcellular compartments or organs are stored in the PPDB and linked to each locus. Using mass spectrometer-derived information and literature, more than 1500 Arabidopsis proteins have a manually assigned subcellular location, with a strong emphasis on plastid proteins.
New features added to the Plant Proteome Database include searchable posttranslational modifications and searchable experimental proteotypic peptides and spectral count information for each identified accession based on in-house experiments. Various search methods are provided to extract more than 40 data types for each accession and to extract accessions for different functional categories or curated subcellular localizations. Protein report pages for each accession provide comprehensive overviews, including predicted protein properties, with hyperlinks to the most relevant databases.
The PPDB is continuously updated with new in-house experiments, as well as external data sets. Cornell is working closely with other national research community databases such as TAIR (The Arabidopsis Information Resource) and Gramene (the open-source data resource for comparative grass genomics), and colleagues around the world to distribute PPDB data and provide efficient links.
Researchers may access PPDB content through its Web interface. A paper published in the January 2009 database issue of Nucleic Acids Research describes the database software and tools.
The Plant Proteome Database was funded by the National Science Foundation, U.S. Department of Energy, New York State Foundation for Science, Technology and Innovation (NYSTAR), Cornell University, and supporting foundations and corporate partners.
The Computational Biology Service Unit is part of the Cornell University Life Sciences Core Laboratories Center (CLC) which provides genomics, proteomics, imaging, IT and informatics shared research resources and services to Cornell and to investigators at other academic institutions and commercial enterprises. The Cornell Center for Advanced Computing (CAC) is a leader in high-performance computing system, application, and data solutions that enable research success.