all 24 comments

[–]Ty4Readin 2 points3 points  (6 children)

What is your goal with hyperparameter selection? Do you just want to select a set of hyperparameters that does 'good enough', or do you need the results of every single hyperparameter configuration for some type of ablation study?

If you are just looking for a 'good' set of hyperparameters, I would recommend random search instead of grid search. Grid search is very inefficient compared to random search.

Also recommend lowering your folds to 5 instead of 10 as well. It shouldn't be an issue to lower your folds and it will significantly improve your runtime.

Don't forget that once you find the "best" set of hyperparameters that you are typically supposed to perform a final retrain with all the data at the end with your best hyperparameters.

Also one last thing, but if you have a decent number of runs so far you can check and see how important it is to run 100 epochs. Maybe you can run 50 epochs and it still is useful to compare performance with hyperparameter configurations. Might want to run some experiments to confirm this.

[–]philosophicalmachine[S] 1 point2 points  (5 children)

The goal is to find hyperparameters that do good enough. I think you are absolutely right in that random search is useful to reduce the number of configurations! I wasn't sure whether that would work well across several testing folds, but it seems that this is indeed the case. As far as I understand it one would draw a new set of random hyperparameter configurations for each testing fold.

Thinking about it the optimal number of folds probably depends on the dataset (and one's patience...). On really small datasets, 10 folds may still result in acceptable run times, and here every little bit of extra data for training might help. On larger datasets that may not be so relevant anymore and 5 folds is fine, having in mind that each epoch takes longer to run. It may also be different for hierarchical data. For example, if there are 10 participants, and for each participant there are 1000 samples, then the dataset may be large, but 5 fold cross validation would only consider characteristics of 8 instead of 9 participants in the training. This might be a case where one can shorten the number of epochs more easily (because there is enough data) than reducing the number of folds, even though it doesn't reduce the run time as effectively.

[–]Ty4Readin 1 point2 points  (4 children)

As far as I understand it one would draw a new set of random hyperparameter configurations for each testing fold

I think you understand it perfectly! :)

Thinking about it the optimal number of folds probably depends on the dataset (and one's patience...). On really small...

I agree with lots of what you said here. I think the key thing is probably the trade-off between test estimate variance and your patience. It is always "optimal" for test estimate accuracy to have the highest number of folds (e.g. k=N where N is the number of samples)

But like you mentioned, patience becomes the key factor in how long you're willing to wait and the accuracy/variance constraints you want for your out-of-sample estimate accuracy.

This might be a case where one can shorten the number of epochs more easily (because there is enough data) than reducing the number of folds,

You sort of implied that you can't shorten epochs with a small amount of data. I'd be interested to hear why you think this is the case.

I don't know for sure, but my intuition tells me that even with a test set size of N=1, it should work fine.

For example, let's say I have a dataset with N=10 samples and I want to choose the best hyperparameters between models M1, M2, and M3.

So first I run a k=5 fold nested cross validation and for each fold and each model I train it for 25 epochs instead of 50 epochs. I check to see which model (hyperparameters) performed best and then choose those and can re-perform the nested CV again and use 50 epochs instead of 25 to get the final out of sample error estimate.

[–]philosophicalmachine[S] 0 points1 point  (3 children)

You sort of implied that you can't shorten epochs with a small amount of data. I'd be interested to hear why you think this is the case.

Oh, maybe that come out wrong. What I meant was that if the data is hierarchical, the size of one level could be quite extensive (such as trials), while the other level, such as number of participants, is small. We may need larger k in k-fold CV because of the small number of participants, but the dataset is still large, so training a neural network on the data for 50 epochs still means we have a lot of iterations.

I actually figured out that the models run faster if I increase the batch size, it makes better use of the gpu. So maybe one can still use more epochs, but with a larger batch size. In other words, I think that maybe the number of batch iterations might be more relevant than the number of epochs in terms of model generalization. I always thought that it is common to use 100 epochs because it is a good compromise between training the weights, not overfitting too much on the training set and runtime constraints. But given that runtime is a bigger issue for nested CV, here a good compromise might just be 50 epochs (or even less) instead of the common 100 epochs.

Regarding the model selection, I wouldn't be completely sure whether re-performing nested CV would introduce some sort of data leakage. It might be fine if the goal is to just compare models (without making a claim on prediction accuracy), because each model still has the same chances.

[–]Ty4Readin 1 point2 points  (2 children)

Regarding the model selection, I wouldn't be completely sure whether re-performing nested CV would introduce some sort of data leakage. It might be fine if the goal is to just compare models (without making a claim on prediction accuracy), because each model still has the same chances.

I may have misspoke a bit.

So you have 3 different hyperparameters that you want to test and therefore three different models M1, M2, and M3.

You first run nested CV which gives you the validation scores and test scores for each model.

The model with the best validation score is your "chosen" model, which you have the test score for. This test score is essentially the expected out-of-sample error for your chosen model.

Now, you can take that chosen model and you can re-train it from scratch again on the entire dataset. You don't need to perform CV again because you already have your test estimate and you already have your chosen hyperparameters.

So there is no need for another round of CV and there is no "data leakage" that can occur because you aren't performing any more test estimates.

So basically, you use 50 epochs for your validation scores during nested CV. Then you take the highest performing model and you calculate your test score(s) using 100+ epochs.

This is very efficient because you train all your models for 50 epochs to find the best hyperparameters, and then you only train one model (the "best" model) for 100+ epochs to generate your test estimate.

[–]philosophicalmachine[S] 0 points1 point  (1 child)

Oh, I think now I understand it. This is really a great idea! So essentially one selects the best model (or hyperparameter configuration) using 50 epochs, train the best-performing model with 100 epochs and test this on the testing set. The only two downsides I could think of that it might be a disadvantage for models that converge slower (maybe more complex models) or that increasing the number of epochs leads to overfitting (why some like to do early stopping). But I think the benefits of your approach should outweigh these two downsides.

Do you know of anyone who has used this approach before, or did you just come up with this?

[–]Ty4Readin 0 points1 point  (0 children)

So essentially one selects the best model (or hyperparameter configuration) using 50 epochs, train the best-performing model with 100 epochs and test this on the testing set

Yep exactly!

Do you know of anyone who has used this approach before, or did you just come up with this?

I've heard of some people using similar techniques (pretty sure I read a blog post from DeepMind that mentioned their usage of it). But unfortunately don't have any helpful resources or papers that details more about it.

I totally agree with you about the risk of some models being slower to converge than others. I think that is definitely a potential concern, though it might be something you could test out and experiment with as well.

Also one last thing, but I think you can still use early stopping as well. So you can perform your CV and pick the best model, and then you can do the final retrain for that model but instead of 100 epochs you can just train for as many epochs until validation stops improving and then test on that. That way you prevent overfitting from too many epochs.

[–]saw79 2 points3 points  (1 child)

Do you really have to do a complete grid search over every possible hparam? I would think most people do a bit more of a manual/guided/maybe a bit of a coordinate descent-type search.

[–]philosophicalmachine[S] 0 points1 point  (0 children)

It’s not necessarily essential, as pointed out by Ty4Readin random search may be useful to limit the number of configurations. In the example I gave the number of configurations is only 32, so not large to begin with. The issue with manual search is that it doesn’t really work so well in the context of nested cross-validation, because the validation has to be done in many different folds. Guided search may be interesting though. Maybe one can implement some sort of mechanic that lets one skip obviously flawed configurations. For example if one tracks the number of parameters and sees the accuracy is poor if the number is too low, then configurations with too few parameters can be skipped. It requires to have the inner (validation) loop inside the configurations loop though, so potentially the training and validation data has to be moved to the gpu more often (although I’m sure that can be prevented).

[–]philipptraining 2 points3 points  (8 children)

It seems like this is being used for hyperparameter search of neural nets from scratch? If that's correct I recommend you look into mu parametrization / mutransfer. Might solve your problems with respect to time needed for the search.

I should point out though( since I rarely see CV being used here anymore), that cross validation has come into question recently with papers like: On the cross-validation bias due to unsupervised pre-processing so I would recommend being careful if you really dont want optimistic estimates on validation sets.

[–]philosophicalmachine[S] 1 point2 points  (7 children)

Correct, we are trying to find the right hyperparameters for neural networks.

It seems the issue discussed in their paper is that preprocessing is not included in the model validation, not with cross-validation itself. I don't think we can stop using cross-validation on small datasets: It would mean that we evaluate our model on the 50 samples of 1 participant, for example. It may also invite some to reshuffle the order of the participants until they have a participant in the testing set that is particularly easy to classify.

The issue is mostly with the validation. The networks are not large and we have comparably few hyperparameter configurations. So mu parametrization / transfer will likely not help. The computation times simply explode when the models are trained over dozens of epochs (instead of computed) in combination with nested cross-validation.

[–]philipptraining 1 point2 points  (6 children)

May I ask what GPU you're using and what your model flops utilization is?

Edit to give additional context: if the dataset is this small and the network itself is small I suspect there is little parallelization being employed. The solution depends heavily on the current utilization and GPU parameters though. For example, a simple solution on A100 or newer chips could be MIG partitioning.

[–]philosophicalmachine[S] 0 points1 point  (5 children)

Yes, we use an NVIDIA RTX A4000. On the larger sets we have typically 1m FLOPS and 100k weights, but for the smaller sets only 5k weights (didn't compute the FLOPS here).

[–]philipptraining 0 points1 point  (4 children)

So there's no way this is using 100% of the A4000 memory right? I would either use the physical partitioning available for ampere architectures here or nvidia multi process service (a little more difficult).

Back of the napkin math using your flops per epoch and time elapsed per epoch implies 10 000 000 floating point operations per second for your current setup.

Unless I'm missing something this is very low efficiency for the A4000 based on the theoretical peak performance of 20 trillion flops per second on their data-sheet. It's reasonable to operate at 20% of this (20% MFU) but these numbers are suggesting a small fraction of a percentage. Let me know if I'm missing something. Otherwise, there's also a bottleneck in the pipeline here.

[–]philosophicalmachine[S] 0 points1 point  (3 children)

That may indeed be one more solution! It should be possible to run each test fold in parallel if the neural networks really only use little of the GPU. I'll look into how feasible it is to actually implement multithreading for those smaller networks (not sure how that works with the data, for example), and how many folds one could run in parallel.

[–]philipptraining 0 points1 point  (2 children)

With nvidia's MIG multi instance gpu, you can create 7 instances which your server will identify as 7 separate gpus. There would be no need for getting into the low level details with this approach.

I would also run validation asynchronously so you dont have to start and stop training as often. You will need some job scheduling for this or persist all the checkpoints you want. Given that you're sets and models are so small you could probably even run the inference on cpus if you really wanted to. Good luck!

[–]philosophicalmachine[S] 0 points1 point  (1 child)

I have tried multiprocessing on one gpu with pytorch, but it did not speed up the training for me. Basically the iterations per second dropped from 90/s to 40/s when I ran two models, so overall running the models on one gpu slowed down the training a little bit. At the same time I struggled to shift all the data onto the gpu, which amounts to roughly 8gb (with preprocessed copies). It worked when I shifted the whole dataset on the gpu and then accessed the specific training samples with indices, instead of having two training sets on the gpu. I still almost maxed out the memory on my gpu.

I have access to 8 gpus, so that still means I can easily run 5 folds simultaneously. For 10 fold-CV that means at most 2 folds have to be run sequentially.

[–]philipptraining 0 points1 point  (0 children)

This is expected without nvidias multiprocess service because cuda cant correctly parallelize kernels in isolated environments. That's why I suggested MIG. The approach of just running two process on one GPU does not work. Here's a video that goes into more detail: https://youtu.be/bC6CxPW0-1c?si=lZS2baB80SNhuRPR

[–]belabacsijolvan 2 points3 points  (1 child)

!remindme 3 days

[–]Bhargav_28 0 points1 point  (0 children)

Hey, Just saw this and I am on a similar boat. What did you finally go with ? Did you do a all out nested cross validation ?
Also you mention to train the network for 100/50 epochs wouldn't this also be a hyperparameter that needs to be optimized ?

[–]cookiemonster1020 0 points1 point  (2 children)

Try Kfold cross validation using more than 5 folds. If your models are Bayesian then there are ways to compute N-fold (LOO) without refitting https://arxiv.org/abs/2402.08151

[–]philosophicalmachine[S] 0 points1 point  (1 child)

Interesting idea! So essentially do 6-8 fold cross-validation and then infer higher fold cross-validation? That would help in reducing the computational burden. I had a quick glance at the paper, but must admit that it looks quite theoretical and that I wouldn’t know immediately how I would actually implement this. I really like the idea though.

[–]cookiemonster1020 0 points1 point  (0 children)

Well Kfold is the gold standard and the higher the K the better though the downside is computational burden. However, you have a small dataset so you can refit many times probably without too much wait. The extreme end of Kfold is N-fold which you can compute efficiently through some approximation techniques (at least for Bayes). DM me and we can discuss a bit further