Abstract
Genotyping is the process of determining what genotypes (DNA sequences) an individual has at specific locations in the genome. The traditional approach to determine these genotypes is through variant calling. However, variant calling is computationally intensive as it requires the individual's genome to be aligned to a reference genome, which is an expensive process. Thus, alignment-free alternatives were developed that, while less accurate, are significantly faster than alignment-based methods by skipping the variant calling step. These alignment-free methods rely on identifying important k-mers (strings of k bases) for a species, to then look for these in individual genomes. These important k-mers are refered to as variant signatures, as they signify the presence of a variant. Finding these variant signatures requires computationally intensive preprocessing of data on known genetic variation for the species. For the human genome, the 1000 Genomes Project provides this vast knowledge base on genetic variation to great benefit for alignment-free genotyping. KAGE is a recent and competitive alignment-free genotyper, both in terms of accuracy and speed. Compared to other existing solutions, such as Malva and PanGenie, KAGE is able to genotype both faster and more accurately. However, while KAGE has impressive performance when genotyping, this is not the case for the preprocessing of k-mers and variant signatures. Analyzing the vast amount of variant data to find and index all relevant k-mers is a time consuming process and makes it impractical to construct new indexes or update existing ones. As such, efficient solutions to these preprocessing steps would significantly improve the practicality of alignment-free solutions such as KAGE. This thesis explores performance improvements for these preprocessing steps, resulting in KIVS, a high performance Python module for k-mer and variant signature analysis. KIVS achieves high performance and usability by being implemented in C++, wrapped in an easy-to-use Python interface. The genome and its possible variations are also represented by an optimized graph using 2-bit encoding to further improve performance. While made with KAGE integration in mind, KIVS is a standalone module that can be used by other genotyping implementations as well.