Retrieval of Genomic Data using PyTables

Skifjeld, Henrik Glasø

Master thesis

View/Open

Skifjeld-Master.pdf (6.262Mb)

Year

2014

Abstract

As genomic data becomes more available, there is an increased need for analysis tools, such as the Genomic HyperBrowser. At the core of this tool is a data model known as GTrackCore – a module for storage and retrieval of genomic data. Currently, it is based on using NumPy memmaps for storage. An issue with this model is that it uses multiple files to represent a single genomic track. The purpose of this work is to investigate whether or not the PyTables library, making use of the HDF5 data format, is suited to replace the current data model of GTrackCore. This is investigated by re-implementing large parts of GTrackCore to have it use the PyTables library and a single HDF5 file for each genomic track, as opposed to NumPy memmaps, followed by a comparison of the performance of the two versions. This thesis primarily focuses on the retrieval part of GTrackCore, and how to utilize the advantages of the PyTables library in such a setting. The findings shows that the cost of having a data model that stores data in a single HDF5 file makes data retrieval slower in some cases, and faster in others. Furthermore, we have seen that, as of now, GTrackCore is only facilitated to utilize a few of the advantages of the PyTables library.