Genomes to Life Contractor-Grantee Workshop III
February 6-9, 2005, Washington, D.C.
Genomics:GTL Program Projects
Sandia National Laboratories
24
Deciphering Response Networks in Microbial Genomes through Data Mining and Computational Modeling
Z. Su3, P. Dam3, V. Olman3, F. Mao3, H. Wu3, X. Chen1, T. Jiang1, B. Palenik2, and Ying Xu3*(xyn@bmb.uga.edu)
1University of California, Riverside, CA; 2Scripps Institution of Oceanography, San Diego, CA; and 3University of Georgia, Athens, GA and Oak Ridge National Laboratory, Oak Ridge, TN
Deciphering of the “wiring diagrams” of biological networks (including metabolic, signaling and regulatory networks) represents a highly challenging problem, due to our lack of general understanding about the conceptual framework of how biomolecules work together as a system and an insufficient amount of experimental data. The majority of on-going computational research has been focusing on developing general methodologies for deriving “functionally equivalent” networks that are consistent with the limited experimental data such as microarray kinetics data, possibly leading to network topologies that are not biologically meaningful.
We have been developing a computational framework, attempting to systematically derive network topologies that are most consistent with (a) information derived through mining genomic sequences and various genomic and proteomic data and (b) the kinetics data derived from microarray gene expression experiments. The framework consists of the following three key components:
Identification of genes involved in a particular biological process: To facilitate identification of genes possibly involved in a particular biological network, we first made genome-scale predictions of (a) gene functions, (b) operon structures and (c) cis regulatory elements at the genome scale. Gene function prediction is based on available genome annotation plus our own function prediction pipeline using additional information, including motif search and structure-based function prediction. Operons are predicted using our own program (see Section on operon/regulon predictions) Cis regulatory elements are predicted using our prediction program CUBIC, in conjunction with microarray data when available, through identification of conserved sequence motifs and similar gene expression patterns. Based on the initial identification of genes possibly involved in a particular biological network, we then refine/expend the gene candidate list through comparing to the information collected in (a), (b) and (c) described above.
Prediction of interaction relationships among these candidate genes: Currently we attempt to predict two types of interactions: (a) protein-protein interactions, both physical interactions and functional associations, and (b) protein-DNA interactions. Protein-protein interactions are predicted using homology search against protein interaction databases such as DIP & BIND and also based on prediction methods such as gene fusion/fission analysis, and phylogenetic profile analysis. We have developed our own methods for protein (transcription factors)-DNA interactions, based on both sequence and structural information. The sequence-based method is mainly based on (a) homology search against known protein-DNA complexes, (b) identification of self-regulation events, and (c) co-evolution information of transcription factors and operons they regulate. When the 3D structures of transcription factors are available, our method can accurately predict the binding affinity between the structure and its predicted DNA binding motifs, providing a highly effective tool for protein-DNA interaction prediction. Another piece of information for bio-molecular interactions comes from mapping known pathways from related organisms to the target genomes. Though not all such pathway mappings will provide complete pathway models in the target genome, the molecular interactions in the predicted pathways are useful and could be used for piecing together the “complete” network model in (3).
Prediction of wiring diagrams through computational optimization: We have developed two complementary methods for prediction of “complete” wiring diagrams of a target network, based on the predicted gene candidates and their (partial) interaction relationships and additional information. The first method connects the partially connected pieces predicted in (2) through mapping them to a genome-scale protein-protein interaction map we predicted in (2). The idea is to find the biologically most meaningful “paths” to connect the unconnected pieces (made of protein-protein and protein-DNA interactions). An algorithm has been developed for accomplishing this. In addition, we are currently developing a new algorithm that connects all the interaction components that are most consistent with the available microarray kinetics data, generalizing the current popular methods. By doing so, we can get wiring diagrams that are consistent with both molecular interaction information derived through data mining and microarray kinetics data.
We now describe two key procedures needed to implement the above computational prediction protocol because of the significance by their right.
Prediction of operons and regulons: We have recently developed a computational capability for prediction of operons in microbes, using multiple sources of information including (a) conserved gene neighborhoods across closely related organisms, (b) detected co-evolutionary information of genes, (c) functional relatedness of genes, (d) inter-genic distance information plus various types of other information. The overall prediction accuracy has reached 80% based on our test results on known operons in E. coli. We have applied this prediction program, in conjunction with available microarray data, to a number of genomes including E. coli, Shewanella, Pyrococcus and Synechococcus. The prediction procedure can be outlined as follows. We run the program to produce the initial operon candidate list and then we compare the predicted operons with available microarray data to check for consistency. Corrections will be made on the initial predictions if genes of the same operon exhibit significantly different expression patterns under any experimental condition, or genes from the neighboring operons have highly similar expression patterns under all known conditions and these operons are very close in the genomic sequence. In general about 5-10% of the original predictions are corrected based on the microarray data. We expect that the prediction accuracy could reach close to or even beyond 90% when sufficient microarray data is available. Based on the predicted operon structures, we then predicted regulons, based on available microarray data and genome-scale prediction of cis regulatory elements. The prediction procedure identifies operons that share similar expression patterns under the given experimental conditions and share conserved (predicted) binding motifs, and then clusters them into regulons. While this work is still in its early stage, we have identified a number of interesting regulons in the genomes we have applied prediction programs.
Pathway mapping: We have recently developed a computational method and software P-MAP for mapping a known pathway/network from one microbial organism to another by combining homology information and genomic structure information. The basic idea is that in microbes, genes working in the same pathway can generally be decomposed into a few operons or, in case of complex pathways/networks, regulons. Such information has not been effectively used in pathway mapping. When mapping known pathways, we first predict all the operons in a genome using our operon prediction program. The predictions are then validated through comparing microarray data mainly to check for consistency between gene expression patterns for genes predicted to be in the same operons or adjacent operons. Our evaluation has indicated that our prediction accuracy is close to 90%. With such information, we then map genes in a pathway template to the target genome that simultaneously gives relatively high sequence similarity between predicted orthologous gene pairs and has all the mapped genes grouped into a number of operons, preferably co-regulated operons based on the predicted cis regulatory elements and available microarray data. We have applied the P-MAP program to map known biological pathways in KEGG and MetaCyc to the cyanobacterial genomes and currently are mapping them to the Shewanella oneidensis MR-1 genome. Some of the mapping results could be found at http://csbl.bmg.uga.edu/WH8102.
Applications: We have applied this computational framework to predict the wiring diagrams of various response networks, which consists of signaling, regulatory and metabolic components. These include the carbon fixation, phosphorus assimilation and nitrogen assimilation networks in cyanobacterial genomes. Research is on going to apply the framework to Shewanella oneidensis MR-1.
Acknowledgement: This project is supported by the U.S.Department of Energy’s Genomics:GTL Program under project “Carbon Sequestration in Synechococcus sp: From Molecular Machines to Hierarchical Modeling” (http://www.genomes-to-life.org)
* Presenting author
