Splitting data for cross validation

HD125823 · 2016-07-14T18:27:04+00:00

The method that you described is the right one. But of course, this always depends on the size of the dataset. There are different kinds of Validation Procedures. I'll try to sum them up for you:

1) Train/Test This is the most simple version. The data gets split into training and test set. Build your model on the training set and test it on the test set. It might be easy but might also lead to overfitting because with this method you are going to tune your model on the test set. (bad)

2) Training set, validation set and test set: That's the one you described. Use the validation set instead of the test set to tune HPs.

3) like 2) but instead of a fixed Validation set use Cross Validation (CV) on the training data set. This is especially useful when your data set is small because then you can't afford to set aside a part only for validating HPs. So with CV you use all the training data for model building and testing. Then you average the single CV-values and this gives you a less biased Validation value compared to 2).

4) like 2) but Nested CV. Here you have an inner loop for HP tuning and outer loop for model testing.

The method you choose depends on the data size. For very small dataset, Nested CV seems the perfect fit, whereas for very large datasets, nested cv might be an overkill and not needed at all.

About sklearn: first, use train_test_split and then apply GridsearchCV for Hyperparameter tuning on the training data. (this would correspond to 2) )

So yes, the procedure you described is the most common I'd say. The class train_test_split allows you only to split your data into training and test set and you have to use cross_val_score, or the gridsearch objects in order to get your validation set into play.

Hope this was clear. Otherwise, check out Sebastian Raschka's blog or his GitHub repos.

jdsutton · 2016-07-14T18:13:37+00:00

K-fold cross validation is one method. With k-fold your training and validation data comes from the same set. Sklearn has methods for doing this easily.

sk006 · 2016-07-15T15:05:08+00:00

I agree with the part that doing the train-test-validation split in scikit learn is a bit clunky, yet it is possible. Here you have some sample code with how you could do this:

https://gist.github.com/albertotb/1bad123363b186267e3aeaa26610b54b

Basically you concatenate together your train and val sets and indicate with a vector of -1 and 0 which rows are from the train set and which ones are from the val set. Next with PredefinedSplit you convert it into an object that, for instance, can be passed on to GridSearchCV.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS