Two Novel Methods for Clustering Short Time-Course Gene Expression Profiles
As genes with similar expression pattern are very likely having the same biological function, cluster analysis becomes an important tool to understand and predict gene functions from gene expression profi les. In many situations, each gene expression profi le only contains a few data points. Directly applying traditional clustering algorithms to such short gene expression profi les does not yield satisfactory results. Developing clustering algorithms for short gene expression profi les is necessary. In this thesis, two novel methods are developed for clustering short gene expression pro files. The fi rst method, called the network-based clustering method, deals with the defect of short gene expression profi les by generating a gene co-expression network using conditional mutual information (CMI), which measures the non-linear relationship between two genes, as well as considering indirect gene relationships in the presence of other genes. The network-based clustering method consists of two steps. A gene co-expression network is firstly constructed from short gene expression profi les using a path consistency algorithm (PCA) based on the CMI between genes. Then, a gene functional module is identi ed in terms of cluster cohesiveness. The network-based clustering method is evaluated on 10 large scale Arabidopsis thaliana short time-course gene expression profi le datasets in terms of gene ontology (GO) enrichment analysis, and compared with an existing method called Clustering with Over-lapping Neighbourhood Expansion (ClusterONE). Gene functional modules identi ed by the network-based clustering method for 10 datasets returns target GO p-values as low as 10-24, whereas the original ClusterONE yields insigni cant results. In order to more speci cally cluster gene expression profi les, a second clustering method, namely the protein-protein interaction (PPI) integrated clustering method, is developed. It is designed for clustering short gene expression profi les by integrating gene expression profi le patterns and curated PPI data. The method consists of the three following steps: (1) generate a number of prede ned profi le patterns according to the number of data points in the profi les and assign each gene to the prede fined profi le to which its expression profi le is the most similar; (2) integrate curated PPI data to refi ne the initial clustering result from (1); (3) combine the similar clusters from (2) to gradually reduce cluster numbers by a hierarchical clustering method. The PPI-integrated clustering method is evaluated on 10 large scale A. thaliana datasets using GO enrichment analysis, and by comparison with an existing method called Short Time-series Expression Miner (STEM). Target gene functional clusters identi ed by the PPI-integrated clustering method for 10 datasets returns GO p-values as low as 10-62, whereas STEM returns GO p-values as low as 10-38. In addition to the method development, obtained clusters by two proposed methods are further analyzed to identify cross-talk genes under fi ve stress conditions in root and shoot tissues. A list of potential abiotic stress tolerant genes are found.
DegreeMaster of Science (M.Sc.)
SupervisorWu, Fang-Xiang; Selvaraj, Gopalan; Gray, Gordon
Copyright DateJanuary 2014
Gene expression profiles
Conditional mutual information
GO enrichment analysis.