Exploring the Behaviour of the Hidden Markov Model on CpG Island Prediction
DNA can be represented abstrzctly as a language with only four nucleotides represented by the letters A, C, G, and T, yet the arrangement of those four letters plays a major role in determining the development of an organism. Understanding the signi cance of certain arrangements of nucleotides can unlock the secrets of how the genome achieves its essential functionality. Regions of DNA particularly enriched with cytosine (C nucleotides) and guanine (G nucleotides), especially the CpG di-nucleotide, are frequently associated with biological function related to gene expression, and concentrations of CpGs referred to as \CpG islands" are known to collocate with regions upstream from gene coding sequences within the promoter region. The pattern of occurrence of these nucleotides, relative to adenine (A nucleotides) and thymine (T nucleotides), lends itself to analysis by machine-learning techniques such as Hidden Markov Models (HMMs) to predict the areas of greater enrichment. HMMs have been applied to CpG island prediction before, but often without an awareness of how the outcomes are a ected by the manner in which the HMM is applied. Two main ndings of this study are: 1. The outcome of a HMM is highly sensitive to the setting of the initial probability estimates. 2. Without the appropriate software techniques, HMMs cannot be applied e ectively to large data such as whole eukaryotic chromosomes. Both of these factors are rarely considered by users of HMMs, but are critical to a successful application of HMMs to large DNA sequences. In fact, these shortcomings were discovered through a close examination of published results of CpG island prediction using HMMs, and without being addressed, can lead to an incorrect implementation and application of HMM theory. A rst-order HMM is developed and its performance compared to two other historical methods, the Takai and Jones method and the UCSC method from the University of California Santa Cruz. The HMM is then extended to a second-order to acknowledge that pairs of nucleotides de ne CpG islands rather than single nucleotides alone, and the second-order HMM is evaluated in comparison to the other methods. The UCSC method is found to be based on properties that are not related to CpG islands, and thus is not a fair comparison to the other methods. Of the other methods, the rst-order HMM method and the Takai and Jones method are comparable in the tests conducted, but the second-order HMM method demonstrates superior predictive capabilities. However, these results are valid only when taking into consideration the highly sensitive outcomes based on initial estimates, and nding a suitable set of estimates that provide the most appropriate results. The rst-order HMM is applied to the problem of producing synthetic data that simulates the characteristics of a DNA sequence, including the speci ed presence of CpG islands, based on the model parameters of a trained HMM. HMM analysis is applied to the synthetic data to explore its delity in generating data with similar characteristics, as well as to validate the predictive ability of an HMM. Although this test fails to i meet expectations, a second test using a second-order HMM to produce simulated DNA data using frequency distributions of CpG island pro les exhibits highly accurate predictions of the pre-speci ed CpG islands, con- rming that when the synthetic data are appropriately structured, an HMM can be an accurate predictive tool. One outcome of this thesis is a set of software components (CpGID 2.0 and TrackMap) capable of ef- cient and accurate application of an HMM to genomic sequences, together with visualization that allows quantitative CpG island results to be viewed in conjunction with other genomic data. CpGID 2.0 is an adaptation of a previously published software component that has been extensively revised, and TrackMap is a companion product that works with the results produced by the CpGID 2.0 program. Executing these components allows one to monitor output aspects of the computational model such as number and size of the predicted CpG islands, including their CG content percentage and level of CpG frequency. These outcomes can then be related to the input values used to parameterize the HMM.
DegreeMaster of Science (M.Sc.)
SupervisorKusalik, Tony; Harkness, Troy
CommitteeMcQuillan, Ian; Wu, FangXiang
Copyright DateApril 2013
Hidden Markov Model