Hide metadata

dc.contributor.authorVinvand, Kristina Haarr
dc.date.accessioned2015-03-03T23:00:16Z
dc.date.available2015-03-03T23:00:16Z
dc.date.issued2014
dc.identifier.citationVinvand, Kristina Haarr. Preselection bias in high dimensional regression. Master thesis, University of Oslo, 2014
dc.identifier.urihttp://hdl.handle.net/10852/42689
dc.description.abstractIn this thesis, we have studied the preselection bias that can occur when the number of covariates in a high dimensional regression problem is reduced prior to a high dimensional regression analysis like the lasso. Datasets in genomics often include ten- or hundred thousands, or even millions, of covariates and a few hundred or less patients. To reduce computations or to make the problem tractable, practitioners often rank the covariates according to univariate importance for the response, and preselect some thousand covariates from the top of the list for multivariate analysis via penalized regression. If the preselection of covariates is not done in a controlled way, this leads to preselection bias. We have studied the effect of preselection on estimation and prediction and the bias this might induce. With a small preselected dataset, the lasso in combination with cross validation tends to select many covariates, which together are able to explain the data at hand very well. However, for a new independent dataset, these covariates predict rather poorly. This is preselection bias. We have visualized the preselection bias through boxplots in several different datasets from genomics and in simulated data. We have also demonstrated that the problem of preselection bias is most evident in datasets where there is a lot of noise, and where there are heavy dependencies between covariates, as the univariate ranking will not be able to capture the structure of the complex relations in this case. To be able to trust predictions made from penalized regression on preselected covariates, the preselection should be coupled with some algorithm that controls how many covariates that should be included in order to avoid the bias. We have studied methods like ``SAFE'', ``strong'' and ``freezing'' that all make preselection more safe, the word safe meaning that the lasso analysis for the preselected set of covariates should conclude with the same result as if all covariates were included.eng
dc.language.isoeng
dc.subjectHigh dimensional regression
dc.subjectgenomics
dc.subjectpenalized regression
dc.subjectregularization
dc.subjectCox regression
dc.subjectlinear regression
dc.subjectlasso
dc.subjectL1 penalty
dc.subjectvariable selection
dc.subjectcross validation
dc.subjectdimension reduction
dc.subjectpreselection
dc.subjectpreselection bias
dc.subjectsafe
dc.subjectstrong
dc.subjectfreezing
dc.titlePreselection bias in high dimensional regressioneng
dc.typeMaster thesis
dc.date.updated2015-03-03T23:04:14Z
dc.creator.authorVinvand, Kristina Haarr
dc.identifier.urnURN:NBN:no-47065
dc.type.documentMasteroppgave
dc.identifier.fulltextFulltext https://www.duo.uio.no/bitstream/handle/10852/42689/11/Vinvand_thesis.pdf


Files in this item

Appears in the following Collection

Hide metadata