DOE Genomes
Human Genome Project Information  Genomics:GTL  DOE Microbial Genomics  home
-

Genomes to Life Contractor-Grantee Workshop III
February 6-9, 2005, Washington, D.C.

Genomics:GTL Program Projects

Sandia National Laboratories

17

DEB: a Data Entry and Browsing Tool for Entering and Linking Synechococcus sp. WH8102 Whole Genome Microarray Metadata from Multiple Data Sources

Arie Shoshani1* (Shoshani@lbl.gov), Victor Havin1, Vijaya Natarajan1, Tony Martino2, Jerilyn A. Timlin2, Katherine Kang3, Ian Paulsen3, Brian Palenik4, and Thomas Naughton5

1Lawrence Berkeley National Laboratory, Berkeley, CA; 2Sandia National Laboratories, Albuquerque, NM; 3The Institute for Genomic Research, Rockville, MD; 4Scripps Institution of Oceanography, La Jolla, CA; and 5Oak Ridge National Laboratory, Oak Ridge, TN

The process of generating and analyzing microarray data for Synechococcus sp. WH8102 whole genome in the Sandia-led GTL project involves three collaborators, where each generates metadata about their operation as well as data files. The Synechococcus sp. microbes are cultured in the Scripps Institution of Oceanography in San Diego, then the sample pool is sent to TIGR in Rockville, Maryland for microarray hybridization, 2-color scanning, and analysis. The scanned files and slides are then sent to Sandia Lab in Albuquerque, New Mexico for analysis and additional scanning with a Hyperspectral Imaging instrument. Each of the institutes has an independent system for keeping track of metadata about their part of the operation and unfortunately these systems do not facilitate easy transfer of metadata details between institutions. This situation is typical of many biology projects, and it begs for a solution.

In this sub-project we set out to develop a single system where such metadata can be collected and linked in an orderly fashion. We developed a web-based Data Entry and Browsing (DEB) tool that can capture the metadata from experiments and laboratories and store them in a database in a computer searchable form. The key need is to have an easy-to-use intuitive system that integrates the metadata of all the related activities in this project. The design of the DEB tool is based on inputs and insights from the biologists on the project and as such contains features that a biologist will find useful. The interface design mimics the familiar laboratory notebook format. The system is built on top of the Oracle database system. The main concept of the interface design is to expose the biologist to a single “object” and its attributes at a time, and presenting objects as pages in a notebook that can be “turned”, yet provide links between the objects in a simple intuitive fashion. An example of such a web-based screen is shown in the figure below.

The most powerful capability of the DEB system is that it is schema-driven, that is, all the interfaces to support all of the above features are generated automatically from the schema definition. Therefore, new metadata schemas can quickly be used to generate DEB interfaces as well as the underlying Oracle database for them. This feature makes this tool immediately applicable to new and/or changing databases. This allowed us to generate databases based on schemas designed by the biologists. Specifically, the scientists from the three sites have defined schemas for the Nucleotide Pool of microbes, for the Microarray Hybridization (based on the MIAME concepts), and the Hyperspectral Imaging and analysis system. The design included the ability to link these schemas and thus allow a researcher from any area of the project to extract metadata from the various parts of the experiment. For example the microarray hybridization schema has “probe_source” that links (points to) the “nucleotide_pool_id” in the Nucleotide Pool schema.

Data entry to the databases is done in two different modes: 1) on-line web-based data entry, and 2) automated data uploading from another database source. The on-line mode is used by the people who culture the Nucleotide Pools (Scripps), and the people running the Hyperspectral Imaging (Sandia). The automated data uploading is used for the Microarray Hybridization metadata (TIGR) because they have their own well-developed internal database system. The automated metadata loading is performed by dumping the metadata into simple formatted files (similar the spreadsheet output format) and have schema-driven tools for loading the data into the common database.

The main features of the DEB system are:

The importance of an easy-to-use system for capturing metadata in GTL cannot be overlooked, especially as an ever growing number of experiments are conducted and a large number of datasets are collected. The ability to quickly and automatically generate metadata systems from a schema description is essential for this evolving field with multiple sources of data gathered independently. DEB is currently running on LBNL’s development server and at ORNL’s GTL project operational server. While this system is designed for this project, its schema-driven architecture means that it can be applied to other GTL efforts.

Sandia is a multi-program laboratory operated by Sandia Corpora94AL85000.

* Presenting author