Comparison of DNA sequence assembly algorithms using mixed data sources
DNA sequence assembly is one of the fundamental areas of bioinformatics. It involves the correct formation of a genome sequence from its DNA fragments ("reads") by aligning and merging the fragments. There are different sequencing technologies -- some support long DNA reads and the others, shorter DNA reads. There are sequence assembly programs specifically designed for these different types of raw sequencing data. This work explores and experiments with these different types of assembly software in order to compare their performance on the type of data for which they were designed, as well as their performance on data for which they were not designed, and on mixed data. Such results are useful for establishing good procedures and tools for sequence assembly in the current genomic environment where read data of different lengths are available. This work also investigates the effect of the presence or absence of quality information on the results produced by sequence assemblers. Five strategies were used in this research for assembling mixed data sets and the testing was done using a collection of real and artificial data sets for six bacterial organisms. The results show that there is a broad range in the ability of some DNA sequence assemblers to handle data from various sequencing technologies, especially data other than the kind they were designed for. For example, the long-read assemblers PHRAP and MIRA produced good results from assembling 454 data. The results also show the importance of having an effective methodology for assembling mixed data sets. It was found that combining contiguous sequences obtained from short-read assemblers with long DNA reads, and then assembling this combination using long-read assemblers was the most appropriate approach for assembling mixed short and long reads. It was found that the results from assembling the mixed data sets were better than the results obtained from separately assembling individual data from the different sequencing technologies. DNA sequence assemblers which do not depend on the availability of quality information were used to test the effect of the presence of quality values when assembling data. The results show that regardless of the availability of quality information, good results were produced in most of the assemblies. In more general terms, this work shows that the approach or methodology used to assemble DNA sequences from mixed data sources makes a lot of difference in the type of results obtained, and that a good choice of methodology can help reduce the amount of effort spent on a DNA sequence assembly project.
DegreeMaster of Science (M.Sc.)
CommitteeSharpe, Andrew; Ziola, Barry; Mcquillan, Ian
Copyright DateApril 2010
Next generation sequencing technoloiges
DNA sequence assembly