Pre-processing of tandem mass spectra using machine learning methods
Protein identification has been more helpful than before in the diagnosis and treatment of many diseases, such as cancer, heart disease and HIV. Tandem mass spectrometry is a powerful tool for protein identification. In a typical experiment, proteins are broken into small amino acid oligomers called peptides. By determining the amino acid sequence of several peptides of a protein, its whole amino acid sequence can be inferred. Therefore, peptide identification is the first step and a central issue for protein identification. Tandem mass spectrometers can produce a large number of tandem mass spectra which are used for peptide identification. Two issues should be addressed to improve the performance of current peptide identification algorithms. Firstly, nearly all spectra are noise-contaminated. As a result, the accuracy of peptide identification algorithms may suffer from the noise in spectra. Secondly, the majority of spectra are not identifiable because they are of too poor quality. Therefore, much time is wasted attempting to identify these unidentifiable spectra. The goal of this research is to design spectrum pre-processing algorithms to both speedup and improve the reliability of peptide identification from tandem mass spectra. Firstly, as a tandem mass spectrum is a one dimensional signal consisting of dozens to hundreds of peaks, and majority of peaks are noisy peaks, a spectrum denoising algorithm is proposed to remove most noisy peaks of spectra. Experimental results show that our denoising algorithm can remove about 69% of peaks which are potential noisy peaks among a spectrum. At the same time, the number of spectra that can be identified by Mascot algorithm increases by 31% and 14% for two tandem mass spectrum datasets. Next, a two-stage recursive feature elimination based on support vector machines (SVM-RFE) and a sparse logistic regression method are proposed to select the most relevant features to describe the quality of tandem mass spectra. Our methods can effectively select the most relevant features in terms of performance of classifiers trained with the different number of features. Thirdly, both supervised and unsupervised machine learning methods are used for the quality assessment of tandem mass spectra. A supervised classifier, (a support vector machine) can be trained to remove more than 90% of poor quality spectra without removing more than 10% of high quality spectra. Clustering methods such as model-based clustering are also used for quality assessment to cancel the need for a labeled training dataset and show promising results.
DegreeMaster of Science (M.Sc.)
CommitteeZhang, W. J. (Chris); Saadat Mehr, Aryan; McQuillan, Ian
Tandem mass spectra; Machine Learning; Denoising;