Parallel algorithms for real-time peptide-spectrum matching
Tandem mass spectrometry is a powerful experimental tool used in molecular biology to determine the composition of protein mixtures. It has become a standard technique for protein identification. Due to the rapid development of mass spectrometry technology, the instrument can now produce a large number of mass spectra which are used for peptide identification. The increasing data size demands efficient software tools to perform peptide identification. In a tandem mass experiment, peptide ion selection algorithms generally select only the most abundant peptide ions for further fragmentation. Because of this, the low-abundance proteins in a sample rarely get identified. To address this problem, researchers develop the notion of a `dynamic exclusion list', which maintains a list of newly selected peptide ions, and it ensures these peptide ions do not get selected again for a certain time. In this way, other peptide ions will get more opportunity to be selected and identified, allowing for identification of peptides of lower abundance. However, a better method is to also include the identification results into the `dynamic exclusion list' approach. In order to do this, a real-time peptide identification algorithm is required. In this thesis, we introduce methods to improve the speed of peptide identification so that the `dynamic exclusion list' approach can use the peptide identification results without affecting the throughput of the instrument. Our work is based on RT-PSM, a real-time program for peptide-spectrum matching with statistical significance. We profile the speed of RT-PSM and find out that the peptide-spectrum scoring module is the most time consuming portion. Given by the profiling results, we introduce methods to parallelize the peptide-spectrum scoring algorithm. In this thesis, we propose two parallel algorithms using different technologies. We introduce parallel peptide-spectrum matching using SIMD instructions. We implemented and tested the parallel algorithm on Intel SSE architecture. The test results show that a 18-fold speedup on the entire process is obtained. The second parallel algorithm is developed using NVIDIA CUDA technology. We describe two CUDA kernels based on different algorithms and compare the performance of the two kernels. The more efficient algorithm is integrated into RT-PSM. The time measurement results show that a 190-fold speedup on the scoring module is achieved and 26-fold speedup on the entire process is obtained. We perform profiling on the CUDA version again to show that the scoring module has been optimized sufficiently to the point where it is no longer the most time-consuming module in the CUDA version of RT-PSM. In addition, we evaluate the feasibility of creating a metric index to reduce the number of candidate peptides. We describe evaluation methods, and show that general indexing methods are not likely feasible for RT-PSM.
DegreeMaster of Science (M.Sc.)
SupervisorMcQuillan, Ian; Wu, FangXiang
CommitteeKim, Theodore; Kusalik, Tony; Teng, Daniel
Copyright DateDecember 2010