Abstract
For evolutionary and medical reasons bacterial classification is an important field within microbiology. Before Carl Woese introduced the use of ribosomal RNA sequences for phylogenetic comparison, bacterial classification was based on different phenotypic
methods. Today the primary center of attention is focused on making super trees (phylogenetic trees generated from multiple genes) and doing whole genome comparison. Still, problems resulting from non-orthogonal gene replacement and interference by lateral gene transfer make this matter far from trivial.
This study is based on the classification of bacteria using the distribution and frequency of selected 10-mer oligonucleotides in complete genome sequences. These frequencies will be detected by an oligonucleotide microarray and the occurring pattern will be
compared to a reference in order to classify a particular organism. In this way it will be possible to compare many bacterial genomes with each other and organize them according to their pattern. Prior to this thesis a set of programs for extraction of informative oligonucleotides from genome sequence data, based on their entropy, have been developed. This study aims to evaluate this method using an in silico approach. Different sub-sets of bacterial genome sequences were used to select sets of informative 10-mer oligonucleotides. In order the test this method a program simulating a microarray was written, such that a suitable output for further analysis was generated. 10-mer oligonucleotide frequencies from the genomes that are to be classified were computed and combined with a set of informative oligonucleotides, in the virtual microarray program. The output from this application was later used in construction of Dendrograms, using the microarray analysis program J-Express. These dendrograms were compared by visual inspection to phylogenetic reference trees made by conventional methods. The phylogenetic analysis was conducted on sequences encoding the 16S rRNA genes, the ATP synthase alpha chain, the prolyl-tRNA synthetase and the methionyl-tRNA synthetase. Our results indicate that the method obtains excellent resolution for discriminating bacteria at the species and strain levels, but not particularly good at the genus level.