Identifying protein complexes and disease genes from biomolecular networks
With advances in high-throughput measurement techniques, large-scale biological data, such as protein-protein interaction (PPI) data, gene expression data, gene-disease association data, cellular pathway data, and so on, have been and will continue to be produced. Those data contain insightful information for understanding the mechanisms of biological systems and have been proved useful for developing new methods in disease diagnosis, disease treatment and drug design. This study focuses on two main research topics: (1) identifying protein complexes and (2) identifying disease genes from biomolecular networks. Firstly, protein complexes are groups of proteins that interact with each other at the same time and place within living cells. They are molecular entities that carry out cellular processes. The identification of protein complexes plays a primary role for understanding the organization of proteins and the mechanisms of biological systems. Many previous algorithms are designed based on the assumption that protein complexes are densely connected sub-graphs in PPI networks. In this research, a dense sub-graph detection algorithm is first developed following this assumption by using clique seeds and graph entropy. Although the proposed algorithm generates a large number of reasonable predictions and its f-score is better than many previous algorithms, it still cannot identify many known protein complexes. After that, we analyze characteristics of known yeast protein complexes and find that not all of the complexes exhibit dense structures in PPI networks. Many of them have a star-like structure, which is a very special case of the core-attachment structure and it cannot be identified by many previous core-attachment-structure-based algorithms. To increase the prediction accuracy of protein complex identification, a multiple-topological-structure-based algorithm is proposed to identify protein complexes from PPI networks. Four single-topological-structure-based algorithms are first employed to detect raw predictions with clique, dense, core-attachment and star-like structures, respectively. A merging and trimming step is then adopted to generate final predictions based on topological information or GO annotations of predictions. A comprehensive review about the identification of protein complexes from static PPI networks to dynamic PPI networks is also given in this study. Secondly, genetic diseases often involve the dysfunction of multiple genes. Various types of evidence have shown that similar disease genes tend to lie close to one another in various biomolecular networks. The identification of disease genes via multiple data integration is indispensable towards the understanding of the genetic mechanisms of many genetic diseases. However, the number of known disease genes related to similar genetic diseases is often small. It is not easy to capture the intricate gene-disease associations from such a small number of known samples. Moreover, different kinds of biological data are heterogeneous and no widely acceptable criterion is available to standardize them to the same scale. In this study, a flexible and reliable multiple data integration algorithm is first proposed to identify disease genes based on the theory of Markov random fields (MRF) and the method of Bayesian analysis. A novel global-characteristic-based parameter estimation method and an improved Gibbs sampling strategy are introduced, such that the proposed algorithm has the capability to tune parameters of different data sources automatically. However, the Markovianity characteristic of the proposed algorithm means it only considers information of direct neighbors to formulate the relationship among genes, ignoring the contribution of indirect neighbors in biomolecular networks. To overcome this drawback, a kernel-based MRF algorithm is further proposed to take advantage of the global characteristics of biological data via graph kernels. The kernel-based MRF algorithm generates predictions better than many previous disease gene identification algorithms in terms of the area under the receiver operating characteristic curve (AUC score). However, it is very time-consuming, since the Gibbs sampling process of the algorithm has to maintain a long Markov chain for every single gene. Finally, to reduce the computational time of the MRF-based algorithm, a fast and high performance logistic-regression-based algorithm is developed for identifying disease genes from biomolecular networks. Numerical experiments show that the proposed algorithm outperforms many existing methods in terms of the AUC score and running time. To summarize, this study has developed several computational algorithms for identifying protein complexes and disease genes from biomolecular networks, respectively. These proposed algorithms are better than many other existing algorithms in the literature.
DegreeDoctor of Philosophy (Ph.D.)
CommitteeZhang, Wenjun (Chris); Kusalik, Anthony; Keil, Mark
Copyright DateNovember 2014
multiple data integration