Genomes to Life Contractor-Grantee Workshop III
February 6-9, 2005, Washington, D.C.
Bioinformatics, Modeling, and Computation
53
PhyloScan: a New Tool for Identifying Statistically Significant Transcription Factor Binding Sites by Combining Cross-Species Evidence
Lee A. Newberg1,2*, C. Steven Carmack1, Lee Ann McCue1 (mccue@wadsworth.org), and Charles E. Lawrence3
1Wadsworth Center, Albany, NY; 2Rensselaer Polytechnic Institute, Troy, NY; and 3Brown University, Providence, RI
If there are known transcription factor binding sites (TFBSs) for a particular transcription factor (TF), then it is possible to construct a motif model or position weight matrix with which to scan a sequence database for additional sites, thereby predicting a regulon. However, scanning a genome for additional TFBSs typically results in finding few statistically significant sites. Specifically, the statistical significance of a sequence match (p-value) to a motif can be assessed by comparison with the probability of observing a match with a score as good or better in a randomly generated search space of identical size and nucleotide composition -- the smaller the p-value the greater the evidence that the match is not due to chance alone. Staden [1] presented an efficient method that exactly calculates this probability, and Neuwald et al. [2] described an implementation of this method. In practice, when scanning a genome or the promoter regions of a genome, it is frequently difficult to identify (below a chosen level of statistical significance) even the known TFBSs that were used in the construction of the motif, to say nothing of additional novel sites for that TF. Essentially, given the statistical nature of this approach, only a relatively small number of TFBSs will be identified that could possibly be considered significant (low sensitivity, high specificity).
With the goal of increasing the statistical power of scanning a genome sequence database with a regulatory motif, we have developed a scanning algorithm, PhyloScan, that combines evidence from matching sites found in orthologous data from several related species. Specifically, we have extended Staden’s method [1] to allow scanning of orthologous sequence data that is either multiply-aligned, unaligned or a combination thereof (aligned and unaligned). PhyloScan statistically accounts for the phylogenetic dependence of the species contributing aligned data and returns a p-value for the sequence match; importantly, the statistical significance is calculated directly, without employing training sets.
To evaluate this method we chose the Escherichia coli Crp and PurR motifs and gathered genome sequence data for several gamma-proteobacteria. Among the species chosen for this study (E. coli, Salmonella enterica Typhi, Yersinia pestis, Haemophilus influenzae, Vibrio cholerae, Shewanella oneidensis, and Pseudomonas aeruginosa), only E. coli and S. typhi exhibit extensive homology in the promoter regions [3]. Thus we aligned orthologous intergenic regions for these two species, and combine statistical evidence from scanning the aligned E. coli and S. typhi data with statistical evidence from scanning unaligned orthologous intergenic regions from the remaining five more distantly related species. This method enhances the identification of TFBSs in E. coli by several-fold over scanning the set of E. coli intergenic regions alone.
- Staden, R. (1989) Methods for calculating the probabilities of finding patterns in sequences. Comput Appl Biosci 5:89-96.
- Neuwald, A. F., Liu, J. S. and Lawrence, C. E. (1995) Gibbs motif sampling: Detection of bacterial outer membrane protein repeats. Protein Sci 4:1618-32.
- McCue, L. A., Thompson, W., Carmack, C. S. and Lawrence, C. E. (2002) Factors influencing the identification of transcription factor binding sites by cross-species comparison. Genome Res 12:1523-32.
* Presenting author
