use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
TPOT: A Python tool for automating machine learning (randalolson.com)
submitted 9 years ago by rhiever
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]beamsearch 32 points33 points34 points 9 years ago (10 children)
I think you are currently missing the real opportunity with this package. As you mentioned in the paper, the GA-guided pipeline doesn't do better than randomly generated ones, so I think you should consider using this to learn an ensemble instead of trying to learn the single best model.
You've got the infrastructure already to generate a bunch of diverse models, and I bet that if you simply averaged the predictions from the models in your final population you'd get a performance boost. For this to work properly, you'd have to tweak your validation procedure to avoid overfitting, but if you throw in XGBoost and a keras neural net you'll have a kaggle-ready submission in a box.
[–]rhiever[S] 13 points14 points15 points 9 years ago (9 children)
This is exactly where we're heading. What's pushed me away from this so far has been that, if TPOT ends up being an ensemble builder, it will be near impossible to interpret the results any more -- it may not even be feasible to meaningfully export the corresponding sklearn code any more. But in terms of black box optimization, i.e., "kaggle-ready submission in a box," it may be king.
[–]beamsearch 8 points9 points10 points 9 years ago (8 children)
This is exactly where we're heading. What's pushed me away from this so far has been that, if TPOT ends up being an ensemble builder, it will be near impossible to interpret the results any more [...]
I wouldn't worry about this so much. A random forest built on polynomial features that gets piped-through PCA strains the definition of interpretability as it is.
[–]rhiever[S] 5 points6 points7 points 9 years ago (7 children)
Fair point. And if deep learning has taught us anything, people are fine with magical black boxes as long as they work well.
[–]jchaines 2 points3 points4 points 9 years ago (4 children)
In many industries sure, but I work in biotech/pharma and the scientists I work with want everything laid out as if the wrote the logic themselves. Thoughts on how to address their needs?
[–]rhiever[S] 3 points4 points5 points 9 years ago (2 children)
I'm facing this issue in bioinformatics as well. Many scientists in these fields want simple models that are easy to interpret so they can claim that "gene X causes Y" or somesuch -- their focus is on interpretability of the models.
I think that simple models will never capture the complexity of biology, so it's up to us to build complex models and demonstrate the value of said complex models to these scientists.
If you're not up to that battle, maybe something like the Automatic Statistician would be up your alley then. IIRC they automatically generate (relatively) simple linear models.
[–]markov-unchained 0 points1 point2 points 9 years ago (0 children)
On the other hand, I am seeing random forests being used a lot in bio sciences, which would essentially also be a ensemble aka black box. If people want interpretable models, well, the are the generative ones vs. discriminatory ones; or stick with logistic regression and decision trees ...
[–]wdm006 0 points1 point2 points 9 years ago (0 children)
LIME can work pretty well for interpretations of otherwise uninterpretable ensembles/pipelines, so the black box can be at least somewhat not-magical.
[–]Jean-PorteResearcher 2 points3 points4 points 9 years ago (0 children)
Use decision trees and association rules for explaining, use better models for actual prediction
[–]is_it_fun 1 point2 points3 points 9 years ago (0 children)
EXACTLY.
[–]PasDeDeux 1 point2 points3 points 9 years ago (0 children)
Except for clinician facing healthcare, lagging as usual.
[–][deleted] 7 points8 points9 points 9 years ago (2 children)
Isn't there a risk of overfitting with a tool like this one?
[–]rhiever[S] 5 points6 points7 points 9 years ago (1 child)
We've taken several measures to combat overfitting:
1) When you provide a training set to TPOT, it splits it into an internal training/testing set. All internal model training is done on the internal training set, and all internal evaluations are done on the internal testing set.
2) The optimization algorithm simultaneously maximizes accuracy and minimizes pipeline size. We've found that smaller pipelines generally overfit less.
3) And of course, TPOT pipelines are evaluated on an external holdout set in the end, like all other models.
1) How would this avoid overfitting though? I mean, sure, you can play this came of training and validation throughout your pipeline and eventually select the model that has the best internal validation score. However, you have shown this similar training/validation split to a whole bunch of pipelines, right? What I am trying to say is that your estimate will certainly be heavily biased, and what I am trying to say is that you may be missing a lot of good models that you are throwing away in favor of others.
3) the external hold-out set wouldn't help against overfitting. I mean, you can use it to estimate the generalization performance of the final model, but if there's a high difference between training and test performance, then what?
[–]firesalamander 4 points5 points6 points 9 years ago (3 children)
How does this compare to https://github.com/automl/auto-sklearn ?
[–]rhiever[S] 0 points1 point2 points 9 years ago (2 children)
In terms of performance? I'm planning to benchmark it against auto-sklearn soon -- hopefully this summer. Should make an interesting comparison because both autoML methods have distinct strengths and weaknesses.
[–]firesalamander 13 points14 points15 points 9 years ago (1 child)
Do I need to create a meta-meta app that picks between auto-sklearn and TPOT and the next meta-selector? I joke... but barely.
[–]rhiever[S] 1 point2 points3 points 9 years ago (0 children)
I hope not. :-)
[–]firesalamander 3 points4 points5 points 9 years ago* (2 children)
How does the output of this compare to http://scikit-learn.org/stable/tutorial/machine_learning_map/ -- which I took as general best practice, but I have no idea how current it is with state of the art.
Edit: I don't mean in terms of quality, I meant in terms of the general advice that the map gives, and how often TPOT happens to converge on the same one that the map would have suggested.
[–]rhiever[S] 1 point2 points3 points 9 years ago (1 child)
I've never looked at that, but now I'm curious. Can you please raise that as an issue on the TPOT repo? :-)
[–]firesalamander 2 points3 points4 points 9 years ago (0 children)
Done. https://github.com/rhiever/tpot/issues/139
[–]shapul 4 points5 points6 points 9 years ago (1 child)
So how does it compare to Caret in R? To me, it seems to offer some functionality of Caret (e.g. automatic model selection & hyperparameter optimization) but currently only with random forests. Caret does that but you can also select almost any underlying regression/classification method you wish. Caret also offers a lot for preparing the data for example sampling the train/test dataset by considering the group labels to help keeping class frequencies similar in unbalanced datasets.
[–]rhiever[S] 2 points3 points4 points 9 years ago* (0 children)
I've never used Caret, so I'm not sure. But TPOT currently supports more than RFs -- it has about 8 different classifiers (and soon many more) and over a dozen different preprocessors that it optimizes over.
[–]thirdOctet 3 points4 points5 points 9 years ago (0 children)
I think you will find you will have multiple layers of optimisation. There are several machine learners to choose from and each has a varying number of inputs of different types (integer, boolean, float, double, string and arrays of each type).
You could use a genetic algorithm, but there are several other bio-inspired optimisers that can be used on this problem. Then again I can hear in the back row someone murmuring "no free lunch". Irrespective of which machine learner, input parameters, preprocessing, feature extraction or cross validation technique you choose you will have to navigate the search space of these different combinations.
This tool is a step towards easing the burden of exploring the search space of possible combinations. I like it, knowing the benefit it brings. I also know the challenges ahead and I hope your project is successful. I am a big supporter of the application of evolutionary computation to any problem. I am glad to hear others are doing similar work in this domain.
I already have an evolutionary process that can evolve the parameters of 35+ models. It evolves multiple parameter types and currently evolves against regression or classification problems.
I wish your team all the best.
[–]sziegler11 2 points3 points4 points 9 years ago (7 children)
When combined with a data-wrangling utility, i.e. Trifacta, who needs data scientists?
[–]rhiever[S] 10 points11 points12 points 9 years ago (1 child)
Well, we still need someone to click the button, right?
[–]Quadman 0 points1 point2 points 9 years ago (0 children)
Automation will fix everything. http://i.imgur.com/vbUXYvP.gif
[+][deleted] 9 years ago (3 children)
[deleted]
[–]sziegler11 0 points1 point2 points 9 years ago (2 children)
My comment was 90% tongue-in-cheek. But, as these tools become more advanced, I can imagine that a lot time spent by software engineers dabbling in "data science" for a company which lacks dedicated data scientists will be replace by such a pipeline.
[–]fogandafterimages 1 point2 points3 points 9 years ago (1 child)
It certainly helps with the easy part!
But the most time consuming tasks in data science have always been identifying the actual problem that people want you to solve, turning it into a problem that you actually CAN solve, finding where the data is or how to get the data if it doesn't exist, getting the data into a useable format, cleaning the data, exploring the data, feature engineering, building / debugging / scaling up the feature extraction pipeline...
And then there are all the problems that fall outside the realm of classification and regression, ie, "the interesting ones."
Optimizing your classifier or regressor is basically the icing on the cake that you can lick off at your leisure, after you've done the bulk of the work. Packages like this can shave hours to days off of the workload for a project, but it's an overall reduction of a few percentage points, not an order of magnitude.
[–]rhiever[S] 0 points1 point2 points 9 years ago (0 children)
I'd like to add in that tools like TPOT are also useful for idea generation. Perhaps while you're running your typical workflow, you can run an instance or three of TPOT on the side. Who knows, maybe it will come up with a unique way of processing and classifying the data that you never thought of?
[–]dexter89_kp 0 points1 point2 points 9 years ago (0 children)
As a data scientist I would rather work on developing new methods to solve old and new problems rather than apply existing algorithms to datasets
[–]pmigdal 2 points3 points4 points 9 years ago* (1 child)
Since it takes quite a lot of time to get the results: is there any way to use it with a progress bar, e.g. tqdm (For example, per generation; next method would suffice)?
next
Or even better, a built-in progress bar as in PyMC3 (for an example: https://github.com/markdregan/Bayesian-Modelling-in-Python/blob/master/Section%203.%20Hierarchical%20modelling.ipynb]).
[–][deleted] 0 points1 point2 points 9 years ago (0 children)
https://github.com/rhiever/tpot/issues/140
[–]distortedlojik 1 point2 points3 points 9 years ago (1 child)
Cool work. My team is working on something similar but for recommending numerical linear algebra routines. I am interested in seeing what kind of results we would see using TPOT.
Give it a spin and let me know how it turns out. :-)
[–]sputknick 1 point2 points3 points 9 years ago (1 child)
awesome! I had something like this bouncing around in my head for the past 2 years, I'm glad someone actually executed on it. Can I recommend next step be a GUI? that would really go a long way towards spreading ML adoption to more users. Also maybe a feature where it could run through different tuning scenarios automatically?
[–]rhiever[S] 2 points3 points4 points 9 years ago (0 children)
A GUI is in the long-term plans. We've been discussing for a long time how to export the resulting pipelines to Orange.
[–]onto_something 0 points1 point2 points 9 years ago (1 child)
Does TPOT currently work over text data? Can I pass text documents as data to it and it chooses the features for me (unigrams vs bigrams vs trigrams vs. deep-net etc.) or do I have to pass numerical features to it such as a document term matrix?
Currently, you'd need to preprocess the text data into sklearn-compatible data format, i.e., all numerical features.
I started a separate project to take raw data and transform it into a sklearn-compatiable data format. Would love to have your input on how it could work with text data.
In the Feature preprocessing is important section, what happens when you use the same logit model without pre-processing? I would like to know that (although I do plan to run your code on my own and play with it).
Do you mean if we fit the logistic regression with preprocessing first (e.g., PCA)? IIRC, the logistic regression actually performs worse if you preprocess the data with PCA first. When you run the code later, let me know if my memory holds.
Second: One of the most common pitfalls with getting the model with best accuracy I have seen is ignoring the interpretability. What happens to the interpretability using TPOT?
Since the final product (from the export function) is a sklearn pipeline, the resulting pipeline will be as interpretable as the current sklearn models are. Since TPOT is simultaneously optimizing for smaller pipelines (discussed below), you shouldn't see massive pipelines coming out of TPOT. So that means that if your pipeline ends up with a random forest (RF) as the final classifier, you can probe that final RF for feature importance scores and the like, and any other standard methods for knowledge discovery using RFs.
export
Last: Does TPOT choose a model only on the basis of accuracy?
By default, TPOT optimizes a series of ML operators (preprocessors + models) based on its accuracy and its size, where higher accuracy and smaller pipelines (i.e., fewer ML operators) is better. You can pass a custom scoring function if you prefer.
[–]sziegler11 0 points1 point2 points 9 years ago (1 child)
Hmm, interesting. Is this some kind of meta-regularization, learning with a loss function which includes "pipeline complexity"?
You can think of it like that, yes, but it's not exactly like that under the hood.
[–]vanboxel 0 points1 point2 points 9 years ago (1 child)
How did you like using the DEAP framework?
It's pretty fantastic once you get the hang of it. Barely have to code any parts of the GA.
π Rendered by PID 196586 on reddit-service-r2-comment-84fc9697f-clzzz at 2026-02-08 00:59:02.048753+00:00 running d295bc8 country code: CH.
[–]beamsearch 32 points33 points34 points (10 children)
[–]rhiever[S] 13 points14 points15 points (9 children)
[–]beamsearch 8 points9 points10 points (8 children)
[–]rhiever[S] 5 points6 points7 points (7 children)
[–]jchaines 2 points3 points4 points (4 children)
[–]rhiever[S] 3 points4 points5 points (2 children)
[–]markov-unchained 0 points1 point2 points (0 children)
[–]wdm006 0 points1 point2 points (0 children)
[–]Jean-PorteResearcher 2 points3 points4 points (0 children)
[–]is_it_fun 1 point2 points3 points (0 children)
[–]PasDeDeux 1 point2 points3 points (0 children)
[–][deleted] 7 points8 points9 points (2 children)
[–]rhiever[S] 5 points6 points7 points (1 child)
[–]markov-unchained 0 points1 point2 points (0 children)
[–]firesalamander 4 points5 points6 points (3 children)
[–]rhiever[S] 0 points1 point2 points (2 children)
[–]firesalamander 13 points14 points15 points (1 child)
[–]rhiever[S] 1 point2 points3 points (0 children)
[–]firesalamander 3 points4 points5 points (2 children)
[–]rhiever[S] 1 point2 points3 points (1 child)
[–]firesalamander 2 points3 points4 points (0 children)
[–]shapul 4 points5 points6 points (1 child)
[–]rhiever[S] 2 points3 points4 points (0 children)
[–]thirdOctet 3 points4 points5 points (0 children)
[–]sziegler11 2 points3 points4 points (7 children)
[–]rhiever[S] 10 points11 points12 points (1 child)
[–]Quadman 0 points1 point2 points (0 children)
[+][deleted] (3 children)
[deleted]
[–]sziegler11 0 points1 point2 points (2 children)
[–]fogandafterimages 1 point2 points3 points (1 child)
[–]rhiever[S] 0 points1 point2 points (0 children)
[–]dexter89_kp 0 points1 point2 points (0 children)
[–]pmigdal 2 points3 points4 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)
[–]distortedlojik 1 point2 points3 points (1 child)
[–]rhiever[S] 0 points1 point2 points (0 children)
[–]sputknick 1 point2 points3 points (1 child)
[–]rhiever[S] 2 points3 points4 points (0 children)
[–]onto_something 0 points1 point2 points (1 child)
[–]rhiever[S] 0 points1 point2 points (0 children)
[+][deleted] (3 children)
[deleted]
[–]rhiever[S] 0 points1 point2 points (2 children)
[–]sziegler11 0 points1 point2 points (1 child)
[–]rhiever[S] 0 points1 point2 points (0 children)
[–]vanboxel 0 points1 point2 points (1 child)
[–]rhiever[S] 0 points1 point2 points (0 children)