DOE Genomes
Human Genome Project Information  Genomics:GTL  DOE Microbial Genomics  home
-

Genomes to Life Contractor-Grantee Workshop III
February 6-9, 2005, Washington, D.C.

Genomics:GTL Program Projects

Sandia National Laboratories

14

Integrating Heterogeneous Databases and Tools for High Throughput Microbial Analysis

Nagiza Samatva* (samatovan@ornl.gov), Al Geist, Praveen Chandramohan, and Ramya Krishnamurthy

Oak Ridge National Laboratory, Oak Ridge, TN

Going beyond simple data archiving and retrieval of diverse data sets, we will describe a knowledge infrastructure that provides capabilities far beyond what has been available before. As part of the Genomics:GTL Synechococcus: From Molecular Machines to Hierarchical Modeling project, we have developed the technology needed to construct an integrated data infrastructure that allows advanced search and queries across a large, diverse set of data sources including sequence databases (COG, INTERPRO, SWISS PROT, TIGR, JGI, PFAM, PRODOM, SMART), structure databases (PDB, COILS, SOSUI, PROSPECT), pathway databases (KEGG), protein interaction databases (BIND, DIP, MIPS), and databases of raw mass spec and microarray data. Both a query language and integrated schema technology were developed to allow search and queries across these diverse databases.

We used our integrated data infrastructure to create a Synechococcus Encyclopedia (see Figure 1) containing all the database knowledge in the world about this microbe. This knowledgebase involves the integration of 23 different databases and is being used to do protein complex predictions, and pathway predictions. The technology can be used to create knowledgebases for other organisms and we have begun discussions with other GTL projects about setting up encyclopedias for their microbes.

The encyclopedia not only has data but also tools to analyze the data. This past year we have added a suite of easy-to-use web-based analysis tools to the encyclopedia. These tools, which are being developed within our GTL project, include protein function characterization, protein structure prediction, comparative analysis of protein-protein interfaces, metadata entry and browsing, pathway prediction, and electronic notebooks. Several of these tools provide transparent access to supercomputers at ORNL and around the nation. We will describe how the encyclopedia data and analysis tools were used to correctly predict the proteins making up a known membrane complex – including the membrane proteins involved – a task that is presently impossible by experimental mass spec analysis alone.

The new ability to rapidly construct advanced queries that require correlating and combining data from sequence annotations, protein structure, and interaction databases and to use the results in co-located analysis tools allows biologists to combine knowledge and see relationships that were previously obscured by the distributed nature and diverse data types in the biological databases.

The presentation will include “live” demonstrations of advanced queries of the Synechococcus Encyclopedia.

Acknowledgement: This project is supported by the U.S.Department of Energy’s Genomics:GTL Program under project “Carbon Sequestration in Synechococcus sp: From Molecular Machines to Hierarchical Modeling” (http://www.genomes-to-life.org)

Figure 1. Synechococcus sp. Encyclopedia. Advanced query and analysis interface. Search all Synechococcus databases. Browse experimental and analysis data. Download datasets.

* Presenting author