Data mining techniques, candidate measures and evaluation methods for building practically useful fault-proneness prediction models

dc.date.accessioned	2013-03-12T07:58:32Z
dc.date.available	2013-03-12T07:58:32Z
dc.date.issued	2008	en_US
dc.date.submitted	2008-08-01	en_US
dc.identifier.citation	Johannessen, Eivind Berg. Data mining techniques, candidate measures and evaluation methods for building practically useful fault-proneness prediction models. Masteroppgave, University of Oslo, 2008	en_US
dc.identifier.uri	http://hdl.handle.net/10852/9958
dc.description.abstract	This thesis describes a study performed in an industrial setting that attempts to build predictive models to identify parts of a Java system with a high fault probability. The system under consideration is constantly evolving as several releases a year are shipped to customers. Developers usually have limited resources for their testing, so our aim was to build optimal and practically useful fault-proneness prediction models to help focus verification and validation activities on the most fault-prone components of this system. This thesis starts off with a literature review that provides detailed discussions of the state-of-the-art of research on fault-proneness prediction models. The review revealed that a vast number of modeling techniques have been used to build such prediction models. However, there has been little systematic effort on assessing the impact of selecting a particular modeling technique. Furthermore, there has been no systematic study of the impact of including certain, alternative types of measures as predictors. Finally, many studies apply certain evaluation methods and model assessment criteria that, depending on the intended use of the prediction model, might be insufficient or even inappropriate. Consequently, the main research focus of this thesis is to systematically assess three aspects on how to build and evaluate fault-proneness models in the context of a large Java legacy system development project: (1) compare many data mining and machine learning techniques to build fault-proneness models, (2) assess the impact of using different metric sets such as source code structural measures and historic change/fault (process) measures, and (3) compare several alternative ways of assessing the performance of the models, in terms of (i) confusion matrix criteria such as accuracy and precision/recall, (ii) ranking ability, using the receiver operating characteristic area (ROC), and (iii) our proposed cost-effectiveness measure (CE). The results of the study indicate that the choice of modeling technique has limited impact on the resulting classification accuracy or cost-effectiveness. There is however large differences between the individual metric sets in terms of cost-effectiveness, and although the process measures are among the most expensive ones to collect, including them as candidate measures significantly improves the prediction models compared with models that only include structural measures and/or their deltas – both in terms of ROC area and in terms of cost-effectiveness. Furthermore, we observe that what is considered the best model is highly dependent on the criteria that are used to evaluate and compare the models. The regular confusion matrix criteria, although popular, are not clearly related to the problem at hand, namely the cost-effectiveness of using fault-proneness prediction models to focus verification efforts to deliver software with less faults at less cost. Consequently, to assess the usefulness of prediction models, we consider the regular confusion matrix criteria of less importance, and recommend to rather use ROC and our proposed measure of cost-effectiveness. Another contribution of this thesis is the provision of a statistically based method for the systematic comparison of fault-proneness prediction models. The method can be reused in future studies to guide the selection of optimal prediction models.	eng
dc.language.iso	eng	en_US
dc.title	Data mining techniques, candidate measures and evaluation methods for building practically useful fault-proneness prediction models	en_US
dc.type	Master thesis	en_US
dc.date.updated	2009-04-06	en_US
dc.creator.author	Johannessen, Eivind Berg	en_US
dc.subject.nsi	VDP::420	en_US
dc.identifier.bibliographiccitation	info:ofi/fmt:kev:mtx:ctx&ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&rft.au=Johannessen, Eivind Berg&rft.title=Data mining techniques, candidate measures and evaluation methods for building practically useful fault-proneness prediction models&rft.inst=University of Oslo&rft.date=2008&rft.degree=Masteroppgave	en_US
dc.identifier.urn	URN:NBN:no-19845	en_US
dc.type.document	Masteroppgave	en_US
dc.identifier.duo	82051	en_US
dc.contributor.supervisor	Erik Arisholm, Lionel Claude Briand	en_US
dc.identifier.bibsys	091963818	en_US
dc.identifier.fulltext	Fulltext https://www.duo.uio.no/bitstream/handle/10852/9958/1/johannessen.pdf

Files in this item

Name:: johannessen.pdf
Size:: 792.2Kb
Format:: application/

View/Open

Appears in the following Collection

Institutt for informatikk [4956]

Hide metadata

Data mining techniques, candidate measures and evaluation methods for building practically useful fault-proneness prediction models

Files in this item

Appears in the following Collection

Browse

For library staff

RSS Feeds