Genomes to Life Contractor-Grantee Workshop III
February 6-9, 2005, Washington, D.C.
Bioinformatics, Modeling, and Computation
52
Exploring Evolutionary Space
Timothy G. Lilburn1* (tlilburn@atcc.org), Yun Bai2, Yuan Zhang2, James R. Cole2, and George M. Garrity2
1American Type Culture Collection, Manassas, VA and 2Michigan State University, East Lansing, MI
The use of principal components analysis (PCA) to visualize the evolutionary relationships among thousands of sequences was developed by us as a tool for aiding in the definition of higher-level prokaryotic taxonomy. It not only helped define higher taxa by revealing naturally occurring clusters within the data, but also proved invaluable in highlighting errors in classification and annotation of sequences. Such PCA projections can be viewed as maps of evolutionary space for the sequences and, by extension, the organisms (and genomes) from which the sequences are obtained. Maps based on SSU rRNA sequences show large gaps between some phylogenetic groups. Presumably, this white space is due to the constraints on the evolution of these molecules that arise from their functional requirements. Sequences that might appear there either simply cannot occur in nature or belong to extinct species. Although PCA and other projection techniques can provide a reasonable approximation of the topology hidden within a dataset, some distortion is inevitable and can be attributed to methodological biases and biases that may exist within the data. Previously, we had demonstrated that we could improve the accuracy of projections for a test case having a known topology and coordinate system by using a set of uniformly distributed external benchmarks. However, neither the true topology nor the coordinate system of the prokaryotic evolutionary space has been defined. Therefore, to understand the distortion, we would need to first define the limits of this space. In this study, we examine the use of a limited number (n=179) of internal reference points (benchmarks) on the transformation of the evolutionary distance data into the new coordinate system defined by PCA. We look at ways of making our maps of evolutionary space more accurate and explore why the white space exists. Methods explored include the generation of synthetic polychimeras, in silico random mutation, and complementation of a set of 179 proposed benchmark sequences. Our results are presented as a set of PCA plots that are evaluated in terms of their resolution and their concordance with the current taxonomy and with each other.
* Presenting author
