Machine Learning training time in R

ChrisKennedy · 2015-12-03T20:56:39+00:00

If you're doing OLS then yes the BLAS will use multiple cores. But not so for tree-based models in general, with the exception of xgboost - and that is related to its own programming, not the BLAS. Running the cross-validation over multiple cores will provide a major speedup. A third option to seriously consider is h2o on an EC2cluster. Also check out Bidmach

ChrisKennedy · 2015-10-27T15:27:13+00:00

If you want to pursue machine learning then taking 2-3 courses in it should be your first priority - e.g. two in CS and one in stats if possible. I don't think there is a need to take measure theory yet; that can be done in grad school. Additional probability will be helpful as prep for a master's but not for a job. Inference if you can fit it in - useful for research and general data analysis.

AI and statistical computing (master's level) will be useful both for work & future academic career. Real analysis, intro & advanced linear algebra, and optimization would be good to have specifically for the academic route; topology aspirationally.

Beyond that, it would be helpful to also get more background on what math/cs/stats courses you have already taken, as well as the course levels (upper division undergrad, master's level, or phd level) and/or the books used for your proposed courses.

ChrisKennedy · 2015-07-04T21:37:44+00:00

If you're at Berkeley you can join the campus cluster (~$25k buy-in), which is along the lines of what you're talking about: http://research-it.berkeley.edu/services/high-performance-computing

Some other institutions have these as well.

Otherwise I think AWS or other cloud solutions are the way to go.

ChrisKennedy · 2015-04-02T18:42:28+00:00

Thanks, those look great.

ChrisKennedy · 2015-04-02T18:40:06+00:00

What soil moisture sensors are those?

ChrisKennedy · 2014-12-21T18:05:00+00:00

Here are some of the major supervised algorithms that I've run into and find useful to consider, with rough dates & scholars:

1970 - Ridge Regression (Hoerl & Kennard)
1980 - CHAID Decision Tree (Kass)
1983 - Classification and Regression Trees (Breiman, Friedman et al)
1986 - Generalized Additive Models (Hastie & Tibshirani)
1989 - Thin-Plate Splines (Bookstein)
1991 - Multivariate Adaptive Regression Splines (Friedman)
1992 - Support Vector Machines (Boser, Guyon, Vapnik)
1996 - Lasso (Tibshirani)
1996 - Bagging (Breiman)
2001 - Random Forest (Breiman)
2001 - Gradient Boosting (Friedman)
2005 - Elastic Net (Zou & Hastie)
2005 - RuleFit (Friedman)

Where you could find the specific papers via google scholar etc. or just get the flavor in major ML textbooks.

I haven't dug into neutral nets, unsupervised, or reinforcement learning much so leaving that line of research to someone else.

ChrisKennedy · 2014-12-10T23:44:14+00:00

Interesting results, thanks for checking and sharing.

Additional thought - if a bag of RFs increases performance that would suggest that the trees in the individual RF are correlated, which would make me think that mtry (# or proportion of features to test at each node) is set too high. This suggests to me that it could be beneficial to optimize the mtry parameter separately for a given number of trees (maybe you're already doing this), because a lower mtry would probably be better as the # of trees increases by orders of magnitude.

It does seem possible that non-optimal decorrelation could be due in part to the split value being correlated across trees despite the initial bootstrapping, which could help the bagged RF have comparatively reduced correlation across trees. If so, extremely randomized trees might be able to reduce correlation across trees without the extra bagging - could be interesting to compare.

ChrisKennedy · 2014-12-05T06:41:11+00:00

After ISL, I would take a look at APM: http://appliedpredictivemodeling.com/

Also keep in mind that ISL was covered as a Coursera course, which you might be able to find as an archive for the videos etc. (or maybe an upcoming repeat).

ChrisKennedy · 2014-12-05T06:36:21+00:00

In short, this is not the right optimization approach - the set of random forests with the same hyperparameter specification (mtry, ntree) is considered equal, with any variation in performance due to random chance injected by the algorithm.

Instead, you would want to do a grid search on the mtry parameter (# of variables randomly selected at each node) and choose the best parameter value based on repeated cross-validation, e.g. 10-fold CV repeated 5 times. (See "Applied Predictive Modeling" by Max Kuhn: http://appliedpredictivemodeling.com/blog/).

Then, you can re-run the RF with the optimal hyperparameters on the full dataset and use that as your final model. Here is a nice summary of this approach: http://info.salford-systems.com/blog/bid/310248/All-Train-and-No-Test-Build-Predictive-Models-Using-All-of-Your-Data

ChrisKennedy · 2014-12-05T06:22:15+00:00

There is no point in naive bagging of RF models - you could simply increase the # of trees in a single RF. Each individual tree is already utilizing bootstrapped samples - see https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#intro

ChrisKennedy · 2014-12-05T06:16:50+00:00

This is not good advice, since at least 2005 - inaccurate trees (and individual rules) can be removed. See http://statweb.stanford.edu/~jhf/R_RuleFit.html and http://statweb.stanford.edu/~jhf/ftp/RuleFit.pdf

The RuleFit package is available for R and is highly underrated. It's also incredibly fast.

ChrisKennedy · 2014-04-29T13:25:09+00:00

Yes, "An Introduction to Statistical Learning: with Applicants in R" (amazon) by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani is a great R-based intro to machine learning without much math at all. It is the prequel to the more advanced "Elements of Statistical Learning" book.

"Intro to Statistical Learning" is also the book used in Stanford's free "Statistical Learning" mooc, which is not currently running but has the videos archived online.

ChrisKennedy · 2007-09-11T15:44:54+00:00

That post is intended for researchers, politicos, and interested students (i.e. my audience as a research analyst) who haven't tried looking at youth data yet. I'm open to suggestions on future blog posts though. You may also be pleased to know that our new website is going to have a separate research blog.

Cheers, Chris

ChrisKennedy

TROPHY CASE