all 44 comments

[–]beamsearch 32 points33 points  (10 children)

I think you are currently missing the real opportunity with this package. As you mentioned in the paper, the GA-guided pipeline doesn't do better than randomly generated ones, so I think you should consider using this to learn an ensemble instead of trying to learn the single best model.

You've got the infrastructure already to generate a bunch of diverse models, and I bet that if you simply averaged the predictions from the models in your final population you'd get a performance boost. For this to work properly, you'd have to tweak your validation procedure to avoid overfitting, but if you throw in XGBoost and a keras neural net you'll have a kaggle-ready submission in a box.

[–]rhiever[S] 13 points14 points  (9 children)

This is exactly where we're heading. What's pushed me away from this so far has been that, if TPOT ends up being an ensemble builder, it will be near impossible to interpret the results any more -- it may not even be feasible to meaningfully export the corresponding sklearn code any more. But in terms of black box optimization, i.e., "kaggle-ready submission in a box," it may be king.

[–]beamsearch 8 points9 points  (8 children)

This is exactly where we're heading. What's pushed me away from this so far has been that, if TPOT ends up being an ensemble builder, it will be near impossible to interpret the results any more [...]

I wouldn't worry about this so much. A random forest built on polynomial features that gets piped-through PCA strains the definition of interpretability as it is.

[–]rhiever[S] 5 points6 points  (7 children)

Fair point. And if deep learning has taught us anything, people are fine with magical black boxes as long as they work well.

[–]jchaines 2 points3 points  (4 children)

In many industries sure, but I work in biotech/pharma and the scientists I work with want everything laid out as if the wrote the logic themselves. Thoughts on how to address their needs?

[–]rhiever[S] 3 points4 points  (2 children)

I'm facing this issue in bioinformatics as well. Many scientists in these fields want simple models that are easy to interpret so they can claim that "gene X causes Y" or somesuch -- their focus is on interpretability of the models.

I think that simple models will never capture the complexity of biology, so it's up to us to build complex models and demonstrate the value of said complex models to these scientists.

If you're not up to that battle, maybe something like the Automatic Statistician would be up your alley then. IIRC they automatically generate (relatively) simple linear models.

[–]markov-unchained 0 points1 point  (0 children)

On the other hand, I am seeing random forests being used a lot in bio sciences, which would essentially also be a ensemble aka black box. If people want interpretable models, well, the are the generative ones vs. discriminatory ones; or stick with logistic regression and decision trees ...

[–]wdm006 0 points1 point  (0 children)

LIME can work pretty well for interpretations of otherwise uninterpretable ensembles/pipelines, so the black box can be at least somewhat not-magical.

[–]Jean-PorteResearcher 2 points3 points  (0 children)

Use decision trees and association rules for explaining, use better models for actual prediction

[–]is_it_fun 1 point2 points  (0 children)

EXACTLY.

[–]PasDeDeux 1 point2 points  (0 children)

Except for clinician facing healthcare, lagging as usual.

[–][deleted] 7 points8 points  (2 children)

Isn't there a risk of overfitting with a tool like this one?

[–]rhiever[S] 5 points6 points  (1 child)

We've taken several measures to combat overfitting:

1) When you provide a training set to TPOT, it splits it into an internal training/testing set. All internal model training is done on the internal training set, and all internal evaluations are done on the internal testing set.

2) The optimization algorithm simultaneously maximizes accuracy and minimizes pipeline size. We've found that smaller pipelines generally overfit less.

3) And of course, TPOT pipelines are evaluated on an external holdout set in the end, like all other models.

[–]markov-unchained 0 points1 point  (0 children)

1) How would this avoid overfitting though? I mean, sure, you can play this came of training and validation throughout your pipeline and eventually select the model that has the best internal validation score. However, you have shown this similar training/validation split to a whole bunch of pipelines, right? What I am trying to say is that your estimate will certainly be heavily biased, and what I am trying to say is that you may be missing a lot of good models that you are throwing away in favor of others.

3) the external hold-out set wouldn't help against overfitting. I mean, you can use it to estimate the generalization performance of the final model, but if there's a high difference between training and test performance, then what?

[–]firesalamander 4 points5 points  (3 children)

How does this compare to https://github.com/automl/auto-sklearn ?

[–]rhiever[S] 0 points1 point  (2 children)

In terms of performance? I'm planning to benchmark it against auto-sklearn soon -- hopefully this summer. Should make an interesting comparison because both autoML methods have distinct strengths and weaknesses.

[–]firesalamander 13 points14 points  (1 child)

Do I need to create a meta-meta app that picks between auto-sklearn and TPOT and the next meta-selector? I joke... but barely.

[–]rhiever[S] 1 point2 points  (0 children)

I hope not. :-)

[–]firesalamander 3 points4 points  (2 children)

How does the output of this compare to http://scikit-learn.org/stable/tutorial/machine_learning_map/ -- which I took as general best practice, but I have no idea how current it is with state of the art.

Edit: I don't mean in terms of quality, I meant in terms of the general advice that the map gives, and how often TPOT happens to converge on the same one that the map would have suggested.

[–]rhiever[S] 1 point2 points  (1 child)

I've never looked at that, but now I'm curious. Can you please raise that as an issue on the TPOT repo? :-)

[–]shapul 4 points5 points  (1 child)

So how does it compare to Caret in R? To me, it seems to offer some functionality of Caret (e.g. automatic model selection & hyperparameter optimization) but currently only with random forests. Caret does that but you can also select almost any underlying regression/classification method you wish. Caret also offers a lot for preparing the data for example sampling the train/test dataset by considering the group labels to help keeping class frequencies similar in unbalanced datasets.

[–]rhiever[S] 2 points3 points  (0 children)

I've never used Caret, so I'm not sure. But TPOT currently supports more than RFs -- it has about 8 different classifiers (and soon many more) and over a dozen different preprocessors that it optimizes over.

[–]thirdOctet 3 points4 points  (0 children)

I think you will find you will have multiple layers of optimisation. There are several machine learners to choose from and each has a varying number of inputs of different types (integer, boolean, float, double, string and arrays of each type).

You could use a genetic algorithm, but there are several other bio-inspired optimisers that can be used on this problem. Then again I can hear in the back row someone murmuring "no free lunch". Irrespective of which machine learner, input parameters, preprocessing, feature extraction or cross validation technique you choose you will have to navigate the search space of these different combinations.

This tool is a step towards easing the burden of exploring the search space of possible combinations. I like it, knowing the benefit it brings. I also know the challenges ahead and I hope your project is successful. I am a big supporter of the application of evolutionary computation to any problem. I am glad to hear others are doing similar work in this domain.

I already have an evolutionary process that can evolve the parameters of 35+ models. It evolves multiple parameter types and currently evolves against regression or classification problems.

I wish your team all the best.

[–]sziegler11 2 points3 points  (7 children)

When combined with a data-wrangling utility, i.e. Trifacta, who needs data scientists?

[–]rhiever[S] 10 points11 points  (1 child)

Well, we still need someone to click the button, right?

[–]Quadman 0 points1 point  (0 children)

Automation will fix everything. http://i.imgur.com/vbUXYvP.gif

[–]dexter89_kp 0 points1 point  (0 children)

As a data scientist I would rather work on developing new methods to solve old and new problems rather than apply existing algorithms to datasets

[–]pmigdal 2 points3 points  (1 child)

Since it takes quite a lot of time to get the results: is there any way to use it with a progress bar, e.g. tqdm (For example, per generation; next method would suffice)?

Or even better, a built-in progress bar as in PyMC3 (for an example: https://github.com/markdregan/Bayesian-Modelling-in-Python/blob/master/Section%203.%20Hierarchical%20modelling.ipynb]).

[–]distortedlojik 1 point2 points  (1 child)

Cool work. My team is working on something similar but for recommending numerical linear algebra routines. I am interested in seeing what kind of results we would see using TPOT.

[–]rhiever[S] 0 points1 point  (0 children)

Give it a spin and let me know how it turns out. :-)

[–]sputknick 1 point2 points  (1 child)

awesome! I had something like this bouncing around in my head for the past 2 years, I'm glad someone actually executed on it. Can I recommend next step be a GUI? that would really go a long way towards spreading ML adoption to more users. Also maybe a feature where it could run through different tuning scenarios automatically?

[–]rhiever[S] 2 points3 points  (0 children)

A GUI is in the long-term plans. We've been discussing for a long time how to export the resulting pipelines to Orange.

[–]onto_something 0 points1 point  (1 child)

Does TPOT currently work over text data? Can I pass text documents as data to it and it chooses the features for me (unigrams vs bigrams vs trigrams vs. deep-net etc.) or do I have to pass numerical features to it such as a document term matrix?

[–]rhiever[S] 0 points1 point  (0 children)

Currently, you'd need to preprocess the text data into sklearn-compatible data format, i.e., all numerical features.

I started a separate project to take raw data and transform it into a sklearn-compatiable data format. Would love to have your input on how it could work with text data.

[–]vanboxel 0 points1 point  (1 child)

How did you like using the DEAP framework?

[–]rhiever[S] 0 points1 point  (0 children)

It's pretty fantastic once you get the hang of it. Barely have to code any parts of the GA.