Hide metadata

dc.date.accessioned2013-03-12T08:22:26Z
dc.date.available2013-03-12T08:22:26Z
dc.date.issued2009en_US
dc.date.submitted2009-03-12en_US
dc.identifier.citationNilsen, Gro. A comparative study of existing and novel methods for estimating the number of clusters in a data set. Masteroppgave, University of Oslo, 2009en_US
dc.identifier.urihttp://hdl.handle.net/10852/10820
dc.description.abstractCluster analysis is a field of study where the aim is to discover distinct groups or clusters in a data set. The objects in the same groups should be similar to each other in some respect, while at the same time dissimilar from objects in the other groups. In 2- or 3-dimensional data sets, this task is simplified by the plotting of the data. In high-dimensional data, on the other hand, the challenge is much greater. One particular discipline in which cluster analysis is commonly used is genomics. In cancer research, for example, the expression of thousands of genes are measured simultaneously, and one may seek to find groups of co-regulated genes, or groups of patients that have similar genetic expression profiles and clinical outcomes. An intrinsic part of cluster analysis is to determine how many clusters are present in the data set. In this thesis, several methods that intend to estimate the number of clusters are presented. These include Gap, Recursive Gap, Silhouette, Prediction strength and In-group proportion. In addition, two novel approaches are introduced, namely Reference Gap and ERA. The methods are first applied on two breast tumour data sets for which earlier studies have indicated the presence of five, possibly six, distinct clusters. ERA and Recursive Gap give results that are the most consistent with the previous findings. The methods are then applied on data sets simulated from various simulation scenarios. The advantage of using simulated data sets is that the true number of clusters is known beforehand, and we may thus make a direct comparison of the methods' effectiveness. The conclusion of these trials is that ERA stands out as the most versatile and successful method, with Recursive Gap not too far behind. The other methods have more varying performance, and are overall less successful than ERA. The results found for both real and simulated data sets in this thesis hence indicate that the novel method ERA provides a valuable approach to the challenging task of estimating the number of clusters in a data set.eng
dc.language.isoengen_US
dc.subjectmodellering dataanalyse kluster analyse Gap metoden mikromatrisedata Silhouetteen_US
dc.titleA comparative study of existing and novel methods for estimating the number of clusters in a data seten_US
dc.typeMaster thesisen_US
dc.date.updated2009-10-13en_US
dc.creator.authorNilsen, Groen_US
dc.subject.nsiVDP::412en_US
dc.identifier.bibliographiccitationinfo:ofi/fmt:kev:mtx:ctx&ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&rft.au=Nilsen, Gro&rft.title=A comparative study of existing and novel methods for estimating the number of clusters in a data set&rft.inst=University of Oslo&rft.date=2009&rft.degree=Masteroppgaveen_US
dc.identifier.urnURN:NBN:no-23169en_US
dc.type.documentMasteroppgaveen_US
dc.identifier.duo89891en_US
dc.contributor.supervisorOle Christian Lingjærde og Ørnulf Borganen_US
dc.identifier.bibsys093498535en_US
dc.identifier.fulltextFulltext https://www.duo.uio.no/bitstream/handle/10852/10820/2/Masteroppgave.pdf


Files in this item

Appears in the following Collection

Hide metadata