Monotone regression in high (and lower) dimensions

Engebretsen, Solveig

Master thesis

Åpne

oppginnlev-69b7 ... ebretsen_masteroppgave.pdf (106.6Mb)

År

2016

Sammendrag

In this thesis, we first present an overview of monotone regression, both in the classical setting and in the high dimensional setting. High dimensional data means that the number of covariates, p, exceeds the number of observations, n. It is often reasonable to assume a monotone relationship between a predictor variable and the response, especially in medicine and biology. The monotone regression methods for the high dimensional data setting that are considered are the liso regression method and the monotone splines lasso regression method (to our knowledge, the only two methods). Both these methods are special forms of penalised regression. The performances of these two high dimensional methods in the classical setting are studied and compared to the performances of existing methods for monotone regression developed for the classical setting, known as MonBoost, scam and scar. The two high dimensional methods work well also in the classical setting, but they do not outperform the existing methods. The two methods can still be useful in the classical setting, since they can be used in situations where the monotonicity directions of the effects are not known, in contrast to the existing methods and also perform automatic variable selection. Furthermore, we investigate the robustness of the monotone splines lasso method to the number of interior knots used to fit the monotone splines and find that it is very robust. In addition, two new methods for fitting a partially linear model where the non-linear covariates are assumed to have a monotone effect on the response are developed. These two methods can be used in the setting where p > n as well as in the classical setting. To our knowledge, no such methods have been developed in the past. The first method, PLAMM-1, is a straight forward extension of the monotone splines lasso method to the partially linear setting. The second method developed, PLAMM-2, is a method with two penal- ties, one on the linear parameters and one on the non-linear parameters. In this last case, estimation has to be performed iteratively, and we prove convergence of the iterative scheme. The estimation, selection and prediction performances of the methods are investigated by simulation experiments in different settings. Through the simulation experiments, the methods are shown to work well in both the classical settings and in the high dimensional setting where the number of observations is not too small. We also apply the partially linear monotone model to a medical dataset where clinical covariates enter the linear part and genomic covariates are assumed to have a monotone effect on the outcome.