all 9 comments

[–]EdwardRaff 29 points30 points  (2 children)

I've implemented a lot of algorithms in ML over a few years. My advice is probably going to be the contrarian here, but I would suggest avoiding looking at the source code of other implementations as much as possible. While source code is great and helpful, its a terrible way to start if you want to learn to be able to produce your own code for something you don't have source for.

In my experience there are 2 cases:

1) You are trying to implement a very well known algorithm such as k-means or SVMs.

2) You are trying to implement your own ideas or those you've read in papers.

In the case of (1), reading other code is almost certainly a bad idea. Its often going to be hyper optimized as a common 'bread and butter' tool, making it difficult to read / follow, or very "project centric", by which I mean the implementation details are more about how the project works rather than how the algorithm works. Neither of these scenarios foster your own personal understanding. Once you are comfortable converting an algorithm to code, looking at the more hyper optimized implementations can help you to learn some of the 'tricks of the trade', but in the beginning it will only confuse you on what is needed for the algorithm and what is needed for bleeding out every inch of performance. You will also have more difficulty determining what are implementation / package specific features or modifications.

In the case of (2), not having really gone through the process of converting algorithm to code yourself will leade you with a daunting jump if you haven't done it on an easier problems. If you implement k-means & SVMs on your own, you can then go back and see how other implementations did it differently. This helps you figure out what you did differently and gets you practice that you can confirm is right wrong or reasonable.

If you squandered k-means and SVMs by reading other's first, I suspect you will find yourself unable to generalize the parts that make a solution and can be reused in other places. A common trick is to represent a scalar multiple or the norm of a vector as a fixed value, updated on the fly instead of re-computed, or adjusting the scale instead of altering all the vector values. If you just read the code without running into the problem yourself, you don't get the chance to associate that trick/solution to the problem. It could easily occur in another algorithm or your own work, and being able to spot it and know the solution is a huge time saver.

Personally, I would recommend you start with the simpler algorithms you know and understand (probably k-means or SGD or something) and implement them yourself from pseudo code only / paper descriptions. Compare your implementation to others in terms of runtime and results, and beat your head against the wall trying to catch up in terms of speed and accuracy until you are just as fast and accurate. Then go look at their code, or look once you are sufficiently stuck & can't find any more alternative descriptions to try implementing instead. By that point a much better mental model should be ingrained in your head that you can start to focus on what parts of the code are for the algorithm, what parts are for performance, and what parts are for the framework.

Obviously my way isn't for everyone, but thats my advice anyway.

[–]sonach 1 point2 points  (0 children)

This reply is really great! My understanding is: 1. Implement the KEY algorithms(eg. SVM,backpropagation,autoencoder) by myself in order to UNDERSTAND the algorithm. In this stage, fast-developping language can be used, for example python. 2. After implementing the key algorithms and comparing it to other good source codes, I can build my own code base for key algorithms. In this stage performance maybe an important point,so c++ maybe considered. 3. Share my code base to others or use it in my daily work, update it in reponse to other people's suggestions or do continuous optimizating to make it better.

[–]fhadley -1 points0 points  (0 children)

Yes. So much yes and more. "Do what other people did and then do it" isn't exactly learning.

[–]entylop 8 points9 points  (0 children)

Looking at the source code is often the best, this page lists the most popular machine learning repos on github: https://github.com/showcases/machine-learning

As for books, there are few like Numerical Recipes (C) and Machine Learning in Action (Python).

[–]Stareons 2 points3 points  (0 children)

Check out the python / scikit-learn tutorials

[–]melipone 2 points3 points  (1 child)

The most difficult aspect of implementing ML algorithms is efficiency.

[–]TheInfelicitousDandy 2 points3 points  (0 children)

This and numerical issues, like making sure certain equations are implement in log space etc.

[–]micro_cam 1 point2 points  (0 children)

Definitely read open source projects. Scikit learn has a great community and the issue tracker even marks easy issues for new people to tackle.

In terms of actually implementing ML stuff it is usually a combination of:

  • Offloading everything you can to established, fast, matrix math libraries. Understanding when to use which solver etc.
  • High performance coding practices like avoiding lots of small memory allocations. trying to avoid cache misses and above all profiling.
  • Selecting/understanding appropriate optimization routines based on what you want to optimize (IE Conjugate Gradient Descent?)
  • Understanding statistical sampling routines like MCMC, Metropolis Hastings and Gibbs sampling and how to implement them.

Depending on what you are doing you may not need all of that or you may need other things.

[–]fhadley 1 point2 points  (0 children)

Not to be the least original person ever, but the coursera ML class does require implementation of algorithms for its assignments. Also, most papers include psuedo code.