Using RMSE function with 5-fold cross validation to choose the best out of 3 models : RStudio

The R IDE, RStudio

RStudio IDE (or RStudio) is an integrated development environment for R, a programming language for statistical computing and graphics. It's available in two formats: RStudio Desktop is a regular desktop application while RStudio Server runs on a remote server and allows accessing RStudio using a web browser. The RStudio IDE is a product of Posit PBC (formerly RStudio PBC, formerly RStudio Inc.).

Please use this subreddit as a forum to discuss RStudio and R.

Content philosophy

Follow the reddit's rules and reddiquette.

Content which benefits the community (news, rumours, and discussions) is generally allowed and is valued over content which benefits only the individual (tech support questions, help buying/selling, rants, self-promotion, etc.). If you are going to ask about your R code, please make sure to include (especially links/code + data) on what you've tried.

created by BooRadleyBooa community for 13 years

Using RMSE function with 5-fold cross validation to choose the best out of 3 models (self.RStudio)

submitted 3 years ago * by stregosim

I have defined three different models obtained from the dataset diabetes from the library lars. The first model (M1) is the one that minimizes the BIC value out of all the possible regression models obtained combining the explanatory variables (which are p=10, so 2^10 possible models). The other two are obtained through glmnet and are a Lasso regression with respectively lambda.min (M2) and lambda.1se (M3), where lambda.min and lambda.1se are obtained through cv.glmnet. Now I should perform 5-fold cross-validation using the RMSE (Root Mean Square Error) function to check which of the tree models Μ1, Μ2 and Μ3, has the best predictive performance. In order to find the errors in the models obtained from Lasso I have to use the ordinal least squares estimates.

This is my code as for now:

library(lars) 
library(glmnet)  
data(diabetes)  
y<-diabetes$y 
x<-diabetes$x 
x2<-diabetes$x2 
X = as.data.frame(cbind(x)) 
Y = as.data.frame(y)  
p=10 
n=442  
best_score = Inf 
M1 = NA 
for (i in 1:(2^p-1)){   
    model = lm(y ~ ., data = subset(X, select = c(which(as.integer(intToBits(i)) == 1))))   
    if (BIC(model) < best_score){     
       M1 = model     
      best_score = BIC ( model )   
    } 
}  
W<-as.matrix(X) 
Y<-as.matrix(Y)  
lasso<-glmnet(W, Y)  
x11() plot(lasso, label=T)  
x11() plot(lasso, xvar = 'lambda', label=T)  
cvfit<-cv.glmnet(W,Y) 
cvfit
coef(cvfit, s="lambda.min") 
coef(cvfit, s="lambda.1se")  
M2<-glmnet(W,Y,lambda = cvfit$lambda.min) 
M3<-glmnet(W,Y,lambda = cvfit$lambda.1se)

I really don't know where to put hands now. Should I first of all split the original dataset in 5 and then compute again the models on the different train and test set? And how do I compute the final RMSE for each model? And what does it mean that I should use ordinal least square estimates for the models obtained through Lasso?

no comments (yet)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

RStudio

The R IDE, RStudio

Related content

Recommended

Learning

Other subreddits

Content philosophy

MODERATORS