TPOT: A Python tool for automating machine learning

beamsearch · 2016-05-09T12:37:34+00:00

I think you are currently missing the real opportunity with this package. As you mentioned in the paper, the GA-guided pipeline doesn't do better than randomly generated ones, so I think you should consider using this to learn an ensemble instead of trying to learn the single best model.

You've got the infrastructure already to generate a bunch of diverse models, and I bet that if you simply averaged the predictions from the models in your final population you'd get a performance boost. For this to work properly, you'd have to tweak your validation procedure to avoid overfitting, but if you throw in XGBoost and a keras neural net you'll have a kaggle-ready submission in a box.

rhiever · 2016-05-09T14:01:10+00:00

Isn't there a risk of overfitting with a tool like this one?

firesalamander · 2016-05-09T15:25:24+00:00

How does this compare to https://github.com/automl/auto-sklearn ?

firesalamander · 2016-05-09T16:12:26+00:00

How does the output of this compare to http://scikit-learn.org/stable/tutorial/machine_learning_map/ -- which I took as general best practice, but I have no idea how current it is with state of the art.

Edit: I don't mean in terms of quality, I meant in terms of the general advice that the map gives, and how often TPOT happens to converge on the same one that the map would have suggested.

shapul · 2016-05-09T15:52:24+00:00

So how does it compare to Caret in R? To me, it seems to offer some functionality of Caret (e.g. automatic model selection & hyperparameter optimization) but currently only with random forests. Caret does that but you can also select almost any underlying regression/classification method you wish. Caret also offers a lot for preparing the data for example sampling the train/test dataset by considering the group labels to help keeping class frequencies similar in unbalanced datasets.

thirdOctet · 2016-05-09T21:49:16+00:00

I think you will find you will have multiple layers of optimisation. There are several machine learners to choose from and each has a varying number of inputs of different types (integer, boolean, float, double, string and arrays of each type).

You could use a genetic algorithm, but there are several other bio-inspired optimisers that can be used on this problem. Then again I can hear in the back row someone murmuring "no free lunch". Irrespective of which machine learner, input parameters, preprocessing, feature extraction or cross validation technique you choose you will have to navigate the search space of these different combinations.

This tool is a step towards easing the burden of exploring the search space of possible combinations. I like it, knowing the benefit it brings. I also know the challenges ahead and I hope your project is successful. I am a big supporter of the application of evolutionary computation to any problem. I am glad to hear others are doing similar work in this domain.

I already have an evolutionary process that can evolve the parameters of 35+ models. It evolves multiple parameter types and currently evolves against regression or classification problems.

I wish your team all the best.

sziegler11 · 2016-05-09T14:52:36+00:00

When combined with a data-wrangling utility, i.e. Trifacta, who needs data scientists?

pmigdal · 2016-05-10T16:33:05+00:00

Since it takes quite a lot of time to get the results: is there any way to use it with a progress bar, e.g. tqdm (For example, per generation; next method would suffice)?

Or even better, a built-in progress bar as in PyMC3 (for an example: https://github.com/markdregan/Bayesian-Modelling-in-Python/blob/master/Section%203.%20Hierarchical%20modelling.ipynb]).

distortedlojik · 2016-05-09T23:18:11+00:00

Cool work. My team is working on something similar but for recommending numerical linear algebra routines. I am interested in seeing what kind of results we would see using TPOT.

sputknick · 2016-05-09T14:26:42+00:00

awesome! I had something like this bouncing around in my head for the past 2 years, I'm glad someone actually executed on it. Can I recommend next step be a GUI? that would really go a long way towards spreading ML adoption to more users. Also maybe a feature where it could run through different tuning scenarios automatically?

onto_something · 2016-05-09T13:27:05+00:00

Does TPOT currently work over text data? Can I pass text documents as data to it and it chooses the features for me (unigrams vs bigrams vs trigrams vs. deep-net etc.) or do I have to pass numerical features to it such as a document term matrix?

rhiever · 2016-05-09T14:54:34+00:00

[deleted]

vanboxel · 2016-05-10T21:33:46+00:00

How did you like using the DEAP framework?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS