Storage of Genomic Data using PyTables

Rongved, Brynjar Grønhaug

Master thesis

View/Open

brynjagr-master.pdf (6.042Mb)

Year

2014

Abstract

The later years have seen an increasing demand for systems that can perform fast genome-wide analyses. An important component of such a system is the data model. The data model of the Genomic HyperBrowser analysis system has recently been extracted to a package called GTrackCore. This package is planned to be integrated in a standalone command-line based analysis toolset. The data produced by GTrackCore is currently stored in an ad-hoc way. Due to some problems with this storage method, it would be beneficial to replace it. The proposed solution is to utilise PyTables, a package for managing hierarchical data sets, built on the HDF5 library. This thesis presents an implementation of a PyTables-based preprocessor in the GTrackCore package. The implementation shows that PyTables can be successfully incorporated in GTrackCore without having to completely restructure the package, but that further adaptation would be beneficial. Measurements of performance and storage efficiency show that the PyTables-based preprocessor demonstrates better or equal performance compared to the old preprocessor in most cases. Further, the PyTables- implementation solves the problems of the current ad-hoc format.