Retroactively Parallelizing a Large Python System

Lillesæter, Jonathan Lunde

Master thesis

Åpne

master.pdf (1.124Mb)

År

2011

Sammendrag

Computers today become more powerful through increased numbers of processors rather than clock speed increases as in the past. Exploiting this parallelism requires different software design strategies than do sequential programs.

The immense increase in the generation of genomic scale data poses an unmet analytical challenge, due to a lack of established methodology with the required flexibility and power. The Hyperbrowser is a framework for comparative analysis of sequence-level genomic data and aims to solve this problem. It is currently a single-threaded system, and in order to both be able to scale better and to reduce the analysis time, parallelization is desirable.

A ﬂexible framework for distributing compute intensive, independent tasks over many computers is presented. The framework allows for many concurrent users, and exploits both local and remote computing resources. This framework is used to achieve signiﬁcant speedups for analyses in the Hyperbrowser, both due to parallelizing the workload and due to exploiting the Titan compute cluster.

Performance tests show that the framework is fairly efficient for both large and small jobs and scales well. A number of possible future improvements are suggested.