Lean MapReduce: A B-tree Inspired MapReduce Framework

Akubue, Arinze George

Master thesis

View/Open

Arinze_George_Akubue_Masters_thesis.pdf (7.904Mb)

Year

2016

Abstract

There is a deluge of unstructured data flowing out from numerous sources, including the devices which make up the Internet-of-Things. This data flow is characterized by sheer volume, variety and velocity, and is expected to double every two years. Organizations perceive hidden value in unstructured data, but are usually constrained by budget and access to the right kind of technology in their effort to extract value. MapReduce has been adopted widely in the big data community for large scale processing of workloads. Current implementations of MapReduce run on persistent compute clusters which feature an underlying distributed file system. The clusters typically process numerous jobs during their lifetime. During periods of low or no activity, the resources are unutilized. This thesis investigates how resources can be optimally and efficiently utilized through the use of adhocly provisioned MapReduce clusters, which are grown into place for each job based on workload dimensions while meeting results deadlines. In order to achieve this, two different designs are developed based on two distinct adaptations of the B-Tree abstract data structure: a flat tree structure, which grows horizontally; and a chain structure with hanging leaves, which grows vertically. The project results show that resources are optimally and efficiently utilized, with each design implementation demonstrating individual advantages and disadvantages, for different workload dimensions.