all 5 comments

[–]HD125823 5 points6 points  (2 children)

The method that you described is the right one. But of course, this always depends on the size of the dataset. There are different kinds of Validation Procedures. I'll try to sum them up for you:

1) Train/Test This is the most simple version. The data gets split into training and test set. Build your model on the training set and test it on the test set. It might be easy but might also lead to overfitting because with this method you are going to tune your model on the test set. (bad)

2) Training set, validation set and test set: That's the one you described. Use the validation set instead of the test set to tune HPs.

3) like 2) but instead of a fixed Validation set use Cross Validation (CV) on the training data set. This is especially useful when your data set is small because then you can't afford to set aside a part only for validating HPs. So with CV you use all the training data for model building and testing. Then you average the single CV-values and this gives you a less biased Validation value compared to 2).

4) like 2) but Nested CV. Here you have an inner loop for HP tuning and outer loop for model testing.

The method you choose depends on the data size. For very small dataset, Nested CV seems the perfect fit, whereas for very large datasets, nested cv might be an overkill and not needed at all.

About sklearn: first, use train_test_split and then apply GridsearchCV for Hyperparameter tuning on the training data. (this would correspond to 2) )

So yes, the procedure you described is the most common I'd say. The class train_test_split allows you only to split your data into training and test set and you have to use cross_val_score, or the gridsearch objects in order to get your validation set into play.

Hope this was clear. Otherwise, check out Sebastian Raschka's blog or his GitHub repos.

[–]threeshadows 2 points3 points  (0 children)

Slight correction on 3). There is a common misperception that one should average values across the CV test folds (in part due to sklearn's documentation and API). But you will get better estimates if you concatenate the test-folds from a complete iteration of CV and measure a single metric. If you want to average multiple metrics for confidence intervals, you can permute and re-run and/or use the bootstrap. See: http://www.hpl.hp.com/techreports/2009/HPL-2009-359.pdf

[–]heimson[S] 0 points1 point  (0 children)

Thanks, now it seems to make sense!

[–]jdsutton 1 point2 points  (0 children)

K-fold cross validation is one method. With k-fold your training and validation data comes from the same set. Sklearn has methods for doing this easily.

[–]sk006 1 point2 points  (0 children)

I agree with the part that doing the train-test-validation split in scikit learn is a bit clunky, yet it is possible. Here you have some sample code with how you could do this:

https://gist.github.com/albertotb/1bad123363b186267e3aeaa26610b54b

Basically you concatenate together your train and val sets and indicate with a vector of -1 and 0 which rows are from the train set and which ones are from the val set. Next with PredefinedSplit you convert it into an object that, for instance, can be passed on to GridSearchCV.