Protein inference based on peptides identified from tandem mass spectra
Protein inference is a critical computational step in the study of proteomics. It lays the foundation for further structural and functional analysis of proteins, based on which new medicine or technology can be developed. Today, mass spectrometry (MS) is the technique of choice for large-scale inference of proteins in proteomics. In MS-based protein inference, three levels of data are generated: (1) tandem mass spectra (MS/MS); (2) peptide sequences and their scores or probabilities; and (3) protein sequences and their scores or probabilities. Accordingly, the protein inference problem can be divided into three computational phases: (1) process MS/MS to improve the quality of the data and facilitate subsequent peptide identification; (2) postprocess peptide identification results from existing algorithms which match MS/MS to peptides; and (3) infer proteins by assembling identified peptides. The addressing of these computational problems consists of the main content of this thesis. The processing of MS/MS data mainly includes denoising, quality assessment, and charge state determination. Here, we discuss the determination of charge states from MS/MS data using low-resolution collision induced dissociation. Such spectra with multiple charges are usually searched multiple times by assuming each possible charge state. Not only does this strategy increase the overall database search time, but also yields more false positives. Hence, it is advantageous to determine the charge states of such spectra before the database search. A new approach is proposed to determine the charge states of low-resolution MS/MS. Four novel and discriminant features are adopted to describe each MS/MS and are used in Gaussian mixture model to distinguish doubly and triply charged peptides. The results have shown that this method can assign charge states to low-resolution MS/MS more accurately than existing methods. Many search engines are available for peptide identification. However, there is usually a high false positive rate (FPR) in the results. This can bring many false identifications to protein inference. As a result, it is necessary to postprocess peptide identification results. The most commonly used method is performing statistical analysis, which does not only make it possible to compare and combine the results from different search engines, but also facilitates subsequent protein inference. We proposed a new method to estimate the accuracy of peptide identification with logistic regression (LR) and exemplify it based on Sequest scores. Each peptide is characterized with the regularized Sequest scores ΔCn∗ and Xcorr∗. The score regularization is formulated as an optimization problem by applying two assumptions: the smoothing consistency between sibling peptides and the fitting consistency between original scores and new scores. The results have shown that the proposed method can robustly assign accurate probabilities to peptides and has a very high discrimination power, higher than that of PeptideProphet, to distinguish correctly and incorrectly identified peptides. Given identified peptides and their probabilities, protein inference is conducted by assembling these peptides. Existing methods to address this MS-based protein inference problem can be classified into two groups: twostage and one unified framework to identify peptides and infer proteins. In two-stage methods, protein inference is based on, but also separated from, peptide identification. Whereas in one unified framework, protein inference and peptide identification are integrated together. In this study, we proposed a unified framework for protein inference, and developed an iterative method accordingly to infer proteins based on Sequest peptide identification. The statistical analysis of peptide identification is performed with the LR previously introduced. Protein inference and peptide identification are iterated in one framework by adding a feedback from protein inference to peptide identification. The feedback information is a list of high-confidence proteins, which is used to update the adjacency matrix between peptides. The adjacency matrix is used in the regularization of peptide scores. The results have shown that the proposed method can infer more true positive proteins, while outputting less false positive proteins than ProteinProphet at the same FPR. The coverage of inferred proteins is also significantly increased due to the selection of multiple peptides for each MS/MS spectrum and the improvement of their scores by the feedback from the inferred proteins.
DegreeDoctor of Philosophy (Ph.D.)
CommitteeZhang, Chris; Purves, Randy; McQuillan, Ian; Alhajj, Reda; Sarty, Gordon; Schreyer, David
Copyright DateDecember 2012
tandem mass spectra
Gaussian mixture model