Preselection bias in high dimensional regression

Vinvand, Kristina Haarr

Master thesis

Åpne

Vinvand_thesis.pdf (2.988Mb)

År

2014

Sammendrag

In this thesis, we have studied the preselection bias that can occur when the number of covariates in a high dimensional regression problem is reduced prior to a high dimensional regression analysis like the lasso. Datasets in genomics often include ten- or hundred thousands, or even millions, of covariates and a few hundred or less patients. To reduce computations or to make the problem tractable, practitioners often rank the covariates according to univariate importance for the response, and preselect some thousand covariates from the top of the list for multivariate analysis via penalized regression. If the preselection of covariates is not done in a controlled way, this leads to preselection bias. We have studied the effect of preselection on estimation and prediction and the bias this might induce. With a small preselected dataset, the lasso in combination with cross validation tends to select many covariates, which together are able to explain the data at hand very well. However, for a new independent dataset, these covariates predict rather poorly. This is preselection bias. We have visualized the preselection bias through boxplots in several different datasets from genomics and in simulated data. We have also demonstrated that the problem of preselection bias is most evident in datasets where there is a lot of noise, and where there are heavy dependencies between covariates, as the univariate ranking will not be able to capture the structure of the complex relations in this case. To be able to trust predictions made from penalized regression on preselected covariates, the preselection should be coupled with some algorithm that controls how many covariates that should be included in order to avoid the bias. We have studied methods like ``SAFE'', ``strong'' and ``freezing'' that all make preselection more safe, the word safe meaning that the lasso analysis for the preselected set of covariates should conclude with the same result as if all covariates were included.