DOE Genomes
Human Genome Project Information  Genomic Science Program  DOE Microbial Genomics  home
-

Close Window

Table 1. GTL Data: Thousands of Times Greater than Genome Data
Experiment Templates for a Single Microbe (acronyms)

Class of Experiment

Time
Points

Treatments

Conditions

Genetic Variants

Biological Replication

Total Biological Samples

Proteomics Data Volume in Terabytes

Metabolite Data in Terabytes

Transcription Data in Terabytes

Simple

10

1

3

1

3

90

18.0

13.5

0.018

Moderate

25

3

5

1

3

1,125

225.0

168.8

0.225

Upper mid

50

3

5

5

3

11,250

2,250.0

1,687.5

2.25

Complex

20

5

5

20

3

30,000

6,000.0

4,500.0

6

Comprehensive

20

5

5

50

3

75,000

15,000.0

11,250.0

15

Profiling Methods

Proteomics: Looking at a possible 6000 proteins per microbe, assuming ~200 gigabytes per sample

Metabolites: Looking a panel of 500 to 1000 different molecules, assuming ~150 gigabytes per sample

Transcription: 6000 genes and 2 arrays per sample ~100 megabytes

Typically, a single significant scientific question takes the multidimensional analysis of at least 1000 biological samples.

This table shows how quickly GTL experiments will generate terabytes (1012 bytes) of proteomic, metabolomic, and transcriptomic data. Global proteomics currently generates ~1.0 terabytes (TB) a day with expected 5- to 10-fold increases per year. Not only massive in volume but also very complex, these data span many levels of scale and dimensionality. For example, in a simple study of a microbial system under a single treatment (such as pH or toxin exposure), three different growth states may be studied, with ten samples taken over the growth of the culture. Replicates of each of these samples will be run as part of quality-assurance protocols. This will result in a total of 90 (3 × 10 × 3) analyses and the generation of more than 18 TB of proteomics data, 13.5 TB of metabolomics data, and 0.018 TB of transcriptomics data. If, however, a more complete set of data is taken to achieve greater temporal fidelity and better understand mechanistic response, the amount of data can grow rapidly. This example of growth in data output demonstrates one of the major data-management challenges of GTL. Strategies and technologies for data compression must be developed that avoid "data decimation," which means knowing all the information that must be extracted from raw data before any is discarded. Current proteomics efforts are employing preliminary technologies for near real-time data reduction.