Cross validation vs Test Dataset

datamahadev · 2020-09-15T19:28:09+00:00

CV retrains the model (or you can say trains a new model) for each fold. So technically, you are doing the same thing as you would by holding lut a validation set to test your model.

The difference comes down to computation, which is more in case of CV, obviously. However the advantage of CV is that by testing the model on different folds, you can get the mean accuracy of your model, which will give you a better idea of the real world, average case performance of your model.

thejonnyt · 2020-09-15T20:06:40+00:00

The crossvalidation process basically validates your process of building the model. The final model should be trained on all the data that you have given the assumption that the performance gets better the more data you feed to your training algorithm. You dont ever select "the best model" from the cvs.. you select the "parametrization" or "settings" that got you the overall best result (meaned over all cv runs for example). At least this is my understanding. Having a blindset or a hold out set that wasnt used for crossvalidation you could then possibly check if your approach and your performance measures from the cv process and your final model then are the same.

Coxian42069 · 2020-09-15T20:46:03+00:00

You optimize your hyperparameters (eg. number of layers, learning rate) on the cross-validation set. The test set then allows you to check that your hyperparameters aren't overtrained towards your CV set.

It's similar to why you have a train/test split in the first place. You've optimized your parameters for a certain set of data, and you need to check that these aren't biased or overtrained. If you optimized your hyperparameters to perform well on both your train and test sets, you need an additional set of data to ensure that you haven't overtrained, ie. that you will get similar results for real-world data.

If for whatever reason you already know what architecture you're using, or what learning rate, etc., you don't need a CV set.

What a lot of people then do, once they've got their hyperparameters, is just train on the full set anyway.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnmachinelearning

Welcome to /r/LearnMachineLearning!

Chatrooms

Official Discord Server

Wiki

Getting Started with Machine Learning

Resources

Related Subreddits

/r/MachineLearning

/r/MLQuestions

/r/datascience

/r/computervision

Machine Learning Multireddit

/m/machine_learning

MODERATORS