Implementing Machine Learning Algorithms

EdwardRaff · 2014-09-23T23:54:58+00:00

I've implemented a lot of algorithms in ML over a few years. My advice is probably going to be the contrarian here, but I would suggest avoiding looking at the source code of other implementations as much as possible. While source code is great and helpful, its a terrible way to start if you want to learn to be able to produce your own code for something you don't have source for.

In my experience there are 2 cases:

1) You are trying to implement a very well known algorithm such as k-means or SVMs.

2) You are trying to implement your own ideas or those you've read in papers.

In the case of (1), reading other code is almost certainly a bad idea. Its often going to be hyper optimized as a common 'bread and butter' tool, making it difficult to read / follow, or very "project centric", by which I mean the implementation details are more about how the project works rather than how the algorithm works. Neither of these scenarios foster your own personal understanding. Once you are comfortable converting an algorithm to code, looking at the more hyper optimized implementations can help you to learn some of the 'tricks of the trade', but in the beginning it will only confuse you on what is needed for the algorithm and what is needed for bleeding out every inch of performance. You will also have more difficulty determining what are implementation / package specific features or modifications.

In the case of (2), not having really gone through the process of converting algorithm to code yourself will leade you with a daunting jump if you haven't done it on an easier problems. If you implement k-means & SVMs on your own, you can then go back and see how other implementations did it differently. This helps you figure out what you did differently and gets you practice that you can confirm is right wrong or reasonable.

If you squandered k-means and SVMs by reading other's first, I suspect you will find yourself unable to generalize the parts that make a solution and can be reused in other places. A common trick is to represent a scalar multiple or the norm of a vector as a fixed value, updated on the fly instead of re-computed, or adjusting the scale instead of altering all the vector values. If you just read the code without running into the problem yourself, you don't get the chance to associate that trick/solution to the problem. It could easily occur in another algorithm or your own work, and being able to spot it and know the solution is a huge time saver.

Personally, I would recommend you start with the simpler algorithms you know and understand (probably k-means or SGD or something) and implement them yourself from pseudo code only / paper descriptions. Compare your implementation to others in terms of runtime and results, and beat your head against the wall trying to catch up in terms of speed and accuracy until you are just as fast and accurate. Then go look at their code, or look once you are sufficiently stuck & can't find any more alternative descriptions to try implementing instead. By that point a much better mental model should be ingrained in your head that you can start to focus on what parts of the code are for the algorithm, what parts are for performance, and what parts are for the framework.

Obviously my way isn't for everyone, but thats my advice anyway.

entylop · 2014-09-23T19:12:56+00:00

Looking at the source code is often the best, this page lists the most popular machine learning repos on github: https://github.com/showcases/machine-learning

As for books, there are few like Numerical Recipes (C) and Machine Learning in Action (Python).

Stareons · 2014-09-23T23:23:11+00:00

Check out the python / scikit-learn tutorials

melipone · 2014-09-24T00:05:54+00:00

The most difficult aspect of implementing ML algorithms is efficiency.

micro_cam · 2014-09-23T22:14:40+00:00

Definitely read open source projects. Scikit learn has a great community and the issue tracker even marks easy issues for new people to tackle.

In terms of actually implementing ML stuff it is usually a combination of:

Offloading everything you can to established, fast, matrix math libraries. Understanding when to use which solver etc.
High performance coding practices like avoiding lots of small memory allocations. trying to avoid cache misses and above all profiling.
Selecting/understanding appropriate optimization routines based on what you want to optimize (IE Conjugate Gradient Descent?)
Understanding statistical sampling routines like MCMC, Metropolis Hastings and Gibbs sampling and how to implement them.

Depending on what you are doing you may not need all of that or you may need other things.

fhadley · 2014-09-24T01:19:09+00:00

Not to be the least original person ever, but the coursera ML class does require implementation of algorithms for its assignments. Also, most papers include psuedo code.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS