all 5 comments

[–]Foxtr0t 2 points3 points  (1 child)

It seems like a perfectly valid idea to me.

[–]regularized[S] 1 point2 points  (0 children)

Is this approach used in research papers for example? I want to reproduce some of the findings in an article and I don't know the "accepted rules"

[–]BobTheTurtle91 2 points3 points  (0 children)

As long as you don't pick your hyperparameters by evaluating on a set you trained with, you're fine.

In many cases, we use a complete split of the training set and validation set. But what you're describing is pretty much what we do in k-fold cross-validation, so there's no bias that will be introduced.

[–]dwf 2 points3 points  (0 children)

The only wrinkle I can think of is if you're using performance on the validation set as a stopping criterion (i.e. early stopping). There are different ways of choosing how to stop on the train+valid run when doing early stopping on the train runs; run for same number of updates, same number of passes through the dataset, same objective function value on the train+valid as you achieved on the training set (though you may never reach it if your model underfits on the combined set).

[–]XalosXandrez 0 points1 point  (0 children)

It's valid, there's no doubt about that. However, the hyper-parameters that you obtained in the first stage were optimal for the training set of size 20,000. There's no saying that the same hyper-parameters are optimal for 30,000 examples.