Bioinformatics challenges of high-throughput SNP discovery and utilization in non-model organisms
A current trend in biological science is the increased use of computational tools for both the production and analysis of experimental data. This is especially true in the field of genomics, where advancements in DNA sequencing technology have dramatically decreased the time and cost associated with DNA sequencing resulting in increased pressure on the time required to prepare and analyze data generated during these experiments. As a result, the role of computational science in such biological research is increasing. This thesis seeks to address several major questions with respect to the development and application of single nucleotide polymorphism (SNP) resources in non-model organisms. Traditional SNP discovery using polymerase chain reaction (PCR) amplification and low-throughput DNA sequencing is a time consuming and laborious process, which is often limited by the time required to design intron-spanning PCR primers. While next-generation DNA sequencing (NGS) has largely supplanted low-throughput sequencing for SNP discovery applications, the PCR based SNP discovery method remains in use for cost effective, targeted SNP discovery. This thesis seeks to develop an automated method for intron-spanning PCR design which would remove a significant bottleneck in this process. This work develops algorithms for combining SNP data from multiple individuals, independent of the DNA sequencing platforms, for the purpose of developing SNP genotyping arrays. Additionally, tools for the filtering and selection of SNPs will be developed, providing start to finish support for the development of SNP genotyping arrays in complex polyploids using NGS. The result of this work includes two automated pipelines for the design of intron-spanning PCR primers, one which designs a single primer pair per target and another that designs multiple primer pairs per target. These automated pipelines are shown to reduce the time required to design primers from one hour per primer pair using the semi-automated method to 10 minutes per 100 primer pairs while maintaining a very high efficacy. Efficacy is tested by comparing the number of successful PCR amplifications of the semi- automated method with that of the automated pipelines. Using the Chi-squared test, the semi-automated and automated approaches are determined not to differ in efficacy. Three algorithms for combining SNP output from NGS data from multiple individuals are developed and evaluated for their time and space complexities. These algorithms were found to be computationally efficient, requiring time and space linear to the size of the input. These algorithms are then implemented in the Perl language and their time and memory performance profiled using experimental data. Profiling results are evaluated by applying linear models, which allow for predictions of resource requirements for various input sizes. Additional tools for the filtering of SNPs and selection of SNPs for a SNP array are developed and applied to the creation of two SNP arrays in the polyploid crop Brassica napus. These arrays, when compared to arrays in similar species, show higher numbers of polymorphic markers and better 3-cluster genotype separation, a viable method for determining the efficacy of design in complex genomes.
DegreeDoctor of Philosophy (Ph.D.)
SupervisorMcQuillan, Ian; Parkin, Isobel
CommitteeKusalik, Anthony; Horsch, Michael; Robinson, Stephen J.; Udall, Joshua
Copyright DateOctober 2014
SNP Genotyping Arrays
Non-model organism informatics