Hide metadata

dc.contributor.authorLangerød, Kristoffer
dc.date.accessioned2017-02-22T22:28:51Z
dc.date.available2017-02-22T22:28:51Z
dc.date.issued2016
dc.identifier.citationLangerød, Kristoffer. Category Clustering: Exploring feature creation and similarity in clustering of cancer patients. Master thesis, University of Oslo, 2016
dc.identifier.urihttp://hdl.handle.net/10852/54081
dc.description.abstractThe human DNA is a 3.1 billion long string of organic molecules, represented by four unique letters, one for each type of molecule in the chain. This string is physically divided into 23 separate pairs, folded to save space and protect against damage. When a cell is dividing through mitosis, this folded structure change to make it possible, but also exposing itself to alterations or damage. Changes are in most cases dealt with by defence mechanisms, but some times they are more severe and can lead to cancer, an uncontrollable growth of cells. Extensive research has been put forward to find better treatment, identifying cancer earlier and to identify the cell of origin. The latter being in the scope of this thesis, as I introduce different approaches, trying to find sub groups of cancer in affected patients. This was done by using machine learning, a subset of artificial intelligence, where the goal is to find patterns or build models to identify objects. Since the ideal result is to find something that does not yet have a definitive answer, clustering, the group of machine learning algorithms that tries to identify patterns with unlabelled data, was used. In it's essence, a clustering algorithm take in representations for each object and returns a grouping. For this to be possible, the representation of objects usually has to be numeric values with a possibility of distinguishing them by their shared attributes. And for several reasons, this representation is stored as a vector which can be used with distance measures such as Euclidean- and Manhattan-distance to calculate similarity between objects. Traditional distance measures have three rules attached to them. One of the rules is that identical vectors should have zero distance. I argue that this does not always make sense. In the case of two people living 0km from Oslo and two other people living 1,000km from Oslo, any distance measure obeying this rule would mark both pairs of people as identical to each other. But not having a property does not make for as strong of a connection as actually having this property. Thus I have implemented distance measures to accommodate the idea that sharing a property is a stronger indication of similarity. This also spurs out of how the numeric properties of objects are calculated, by using a reference set of information about DNase I Hypersensitive sites, which relates to active sites. So the objects are not compared directly but by how they relate to certain parts of the reference set. Another rule for traditional distance measures is that the distance from object A to C is always smaller or equal to the distance of A to B plus B to C. I also argue that this does not always make sense, as objects can be similar in different ways. A and B can be similar in one way, B and C in a different way and A and C can be completely dissimilar. To make this effect possible, the reference set is divided into sub sets and similarity between objects can be in one or more of these sets. When a similarity is established within a part of the reference set, there are locking mechanics that stops the transitive effect from occurring. The methods developed was used to cluster 1889 donors, each with number of mutations ranging from 1,000 to 10,000, and 163 reference files of DNase I Hypersensitive site data to help create the representations. Results show that clustering with implemented methods yield a significantly higher probability than by chance to group donors by their cancer type. Also, the probability of donors being grouped so that their cancer type matches the tissue type of reference sub-sets, blindly chosen by the clustering algorithm, was with certain distance measure and arguments shown to be significantly higher than by chance.eng
dc.language.isoeng
dc.subjectcategory clustering
dc.subjectfeature creation
dc.subjectclustering
dc.subjectcancer
dc.subjectDNase
dc.subjectmachine learning
dc.subjectDNase 1 hypersensitive sites
dc.subjectcell of origin
dc.titleCategory Clustering: Exploring feature creation and similarity in clustering of cancer patientseng
dc.typeMaster thesis
dc.date.updated2017-02-22T22:28:51Z
dc.creator.authorLangerød, Kristoffer
dc.identifier.urnURN:NBN:no-57207
dc.type.documentMasteroppgave
dc.identifier.fulltextFulltext https://www.duo.uio.no/bitstream/handle/10852/54081/1/master.pdf


Files in this item

Appears in the following Collection

Hide metadata