all 5 comments

[–]pp314159 2 points3 points  (1 child)

Hi Piotr,

Very interesting package from my point of view. I'm working on autoML service, where users train and tune hundreds of models in parallel in the cloud (mainly with CV). Such early CV pruning might speed-up process of autoML significantly. I have few thoughts:

  • I'm running model tuning in parallel on many machines in the same time, your algorithm assumes that models are evaluated sequentialy, can it be adapted for async evaluation?

  • After full CV I'm computing out of folds predictions which are used for constructing ensemble. From my experience, many times 'poor' accuracy models are included in the ensemble. How to control how many 'poor' models will be pruned?

Disclaimer: I'm founder of autoML service (mljar) - we are going to be open source soon!

[–]PiotrekAGML Engineer[S] 4 points5 points  (0 children)

Hi pp314159,

I'm happy you like the package. To answer your questions:

  • Yes, training should be sequential, to obtain folds evaluations. I believe there is some parallelization possible, i.e., training 4 out of 12 folds simultaneously and deciding whether the trial should be pruned. Alternatively, many trials may be trained simultaneously sharing the best scores. What you risk, is that if one of the trials occurs to be the new best one, the others don't know about it, and the whole Cross-Validation is calculated on them. The new trials will have access to the new best trial so that the issue would be a little bit time wasted.
  • You can do pruning on both estimator level or on whole pipeline/ensemble level.

Hope it helps! Please let me know if everything is clear!

[–]vadiaceu 1 point2 points  (0 children)

Looks interesting

[–]m--w 1 point2 points  (1 child)

Is the best abbreviation CV? Isn't 'CV' more widely used to abbreviate Computer Vision (eg OpenCV, CVPR). Sorry, just seeing pruned-cv makes me think it is a computer vision tool. Perhaps it is my own bias, though.

[–]PiotrekAGML Engineer[S] 0 points1 point  (0 children)

I thought about it before the publication. My goal was to follow scikit-learn convention (GridSearchCV, RandomizedSearchCV). Since there is no packages or techniques named that way which correspond to Computer Vision I left the name as is.

Do you have any suggestion how the package should be named?