Genomes to Life Contractor-Grantee Workshop III
February 6-9, 2005, Washington, D.C.
Genomics:GTL Program Projects
Sandia National Laboratories
15
Toward Comprehensive Analysis of MS/MS Data Flows
Andrey Gorin* (agor@ornl.gov), Nikita D. Arnold, Robert M. Day, and Tema Fridman
Oak Ridge National Laboratory, Oak Ridge, TN
Tandem mass spectrometry (MS/MS) is a powerful tool applied across several Genomics:GTL projects for a variety of challenging proteomics projects: search for modified proteins, characterization of whole cell proteome, and identification of components of protein molecular machines. Despite great variety of the biological drivers, computational algorithms used “under the hood” face exactly the same challenges, and existing limitations of such algorithms are reproduced across many experimental designs. in ion trap devices under common conditions only ~20% of MS/MS spectra lead to peptide identifications that are worth to be considered, and misidentification rates remains to be high.
In certain range of score values the problem presents the tug-of-war alternative — boost of the reliability threshold (e.g. SEQUEST x-correlation value) rapidly decreases fraction of spectra that could be identified, while lowering it produces identifications from the “grey area”, which are of dubious quality. Algorithmically, the only way out is to increase information extraction from tandem MS data. If we could somehow retrieve total information content of a given spectrum, its fate can be decided unambiguously depending on our capacity to learn from it. Such capability could be useful for a number of other interesting proteomics applications. The difficulties, of course, start with the definition of something as unusual as information content of peptide spectrum.
Recently we proposed Probability Profile Method (PPM) — classification algorithm that infers identities of the individual spectral peaks examining their spectral neighborhoods under the “microscope” of Bayesian statistics. PPM results have the form of probabilistic statements, like peak number 123 is a b-ion with a 0.85 probability. Efficient identification of “noble” b- and y-ion peaks dramatically simplifies construction of de novo tags (partial peptides) for a particular spectrum. Relatively simple algorithmic advances allowed us to build PPM-chain – tool for de novo protein tagging based on our methodology. During this study we have realized that traditional separation of MS computational algorithms into database search and de novo is very misleading. Our PPM-chain can be used in SEQUEST-emulation mode, taking full advantage of the known protein database, but at the same time has quite unique algorithmic capabilities, which include classical full-length de novo sequencing (it is not very good at the later task yet).
While capable of emulating SEQUEST our program works on entirely different mathematical and algorithmic principles. The laborious comparison between theoretical spectra and experimental spectra is the main CPU time consumer in database look-up algorithms, and correspondingly the performance typically scales linearly with the size of the search space, which grows exponentially in many situations (e.g., with the number of PTMs considered for each peptide). In de novo approach, almost all work is done up front, on the experimental spectra: peak labeling, finding of the tags, tag scoring. The need for the database comes very late in the process, involves very few candidate sequences and very simple procedures, which could be skipped all together for spectra with too little (no connectable peaks) or a lot (direct de novo identification) in terms of the informational content.
De novo identification also has inherent flexibility, which is reflected in the suppleness of its output. For a given spectrum and given specifications for de novo tag (e.g. 3 residues are set as a minimum length) PPM-chain has three possible outcomes: (1) “no tag” - no satisfactory de novo tag could be constructed for the spectrum; (2) “tag-no-match” – there are good de novo tags, but they do not conform to the available database; (3) “answer” – satisfactory de novo tag is found and mapped to a protein in the database. In contrast, database look-up programs return the best match with an attached score, which slowly decreases from the confidently identified spectra toward definite identification failures. In this case the bad quality of the match (e.g., because the database protein contains sequencing error) is hard to distinguish from the mediocre informational content of the spectrum (e.g., due to poor fragmentation). Such mix-up leads to all kinds of “grey area” situations, where valuable information - often indicative of unusual and interesting biological events - can be irrecoverably lost
We compared PPM-chain and SEQUEST using data sample obtained on the 54 ribosomal proteins of the Rhodopseudomonas palustris produced by Dr. Michael Strader at ORNL Center for Molecular and Cellular System. We have explored results of both programs for three spectral sets separated by SEQUEST X-correlation score: “high quality” (>3.2), “medium” (from 2.2 to 3.2) and “low quality” (<2.2) spectra. For the high quality subset “no tag” outcome was obtained only for 21 spectra (1.4%) and out of 1263 “answer” results SEQUEST identification was confirmed for 99.9% spectra. “Tag-no-match” outcome was observed for 216 cases (14%) and this fraction kept increasing in medium and low quality subsets: (~ 38% in both). The “answer” outcomes were still excellently aligned with SEQUEST ids (99% and 96% precision values, correspondingly). The fraction of “no tag” cases has grown sharply: 18% for medium and 57% for low confidence sets, reflecting the absence of the differentiating information in many spectra belonging there.
The result suggests an interesting speculation about possible sources of SEQUEST errors: it is feasible that a large fraction of such errors is due to the absence of the underlying correct answer in the database. In such cases the returned match bound to be an incorrect one, but still may have relatively high X-correlation value. In our approach such spectra immediately become candidates for further study, such as Post Translational Modification (PTM) search or further de novo processing.
Summarizing, our testing indicates the following:
Even with the existing technology (which certainly could be improved) reliable de novo tags can be constructed for a large majority of MS/MS spectra – and virtually for all high quality spectra.
When de novo solution is compatible with the database, it is almost always the same as provided by SEQUEST. This conclusion confirms a high reliability of the SEQUEST identifications in the cases when the expected peptides are present in the protein database.
There is a significant fraction of spectra (~33% for medium X-correlation values, ~50% low X-corr values), where the PPM-Chain finds good de novo tags not compatible with anything in the target database. Some of these tags definitely reflect complex and interesting biological phenomena, where PTMs and point mutations are blocking the possibility of finding the correct answers in the “plain vanilla” database searches.
For a future work we plan to apply PPM-chain for a comprehensive data extraction from the proteomics samples aiming at low abundance proteins as well as interesting biological facts, such as PTMs and mutated proteins. Our results strongly suggest that this approach will not only increase the output of useful information, but will also eliminate significant part of incorrect identification, further improving quality of the corresponding proteomics studies.
This work was funded in part by the US Department of Energy's Genomics:GTL program (genomicsgtl.energy.gov) under two projects, “Carbon Sequestration in Synechococcus Sp.: From Molecular Machines to Hierarchical Modeling” (www.genomes-to-life.org) and “Center for Molecular and Cellular Systems” (www.ornl.gov/GenomestoLife).
* Presenting author
