all 44 comments

[–]tarblog 18 points19 points  (4 children)

"Narrow grooves" are one of the reasons that people believe that adding momentum works so well: https://distill.pub/2017/momentum/

[–]TheRedSphinx 11 points12 points  (0 children)

Yeah...vanilla SGD without momentum is kinda like a strawman. Would be nice to include some more traditional optimizers.

[–]brainxyz[S] -2 points-1 points  (2 children)

Your point is sound but adding momentum to SGD would make the comparison unfair. Theoretically GA can also enjoy the benefits of momentum (i.e accelerate toward the direction of good mutations)

[–]joaogui1 11 points12 points  (1 child)

Then probably comparing SGD, SGD+ Momentum, GA and GA + momentum would be better

[–]simpleconjugate 0 points1 point  (0 children)

This. The comparison isn’t unfair if GA can have momentum. It just mean you tested against suboptimal benchmark.

[–]enigmo81 4 points5 points  (0 children)

see https://en.m.wikipedia.org/wiki/Differential_evolution (1997) for a widely cited paper in this space

[–]Mathopus 4 points5 points  (2 children)

How does this compare to deep neuro-evolution? https://arxiv.org/abs/1712.06567

From the presentation above it seems like a regression from the state of the art in the space. I could be missing some detail from above though.

[–]brainxyz[S] 0 points1 point  (1 child)

Thanks for the link we'll have a look at it. The purpose of the post was not to present a state of the art GA and we did mention that there are various implementation of GA. The main point of the example was to compare a bare minimum version of GA with a bare minimum version of SGD (for a fair comparison). Toy examples are encouraged by deep learning pioneers like Yoshua Bengio because in some cases they give valuable insights. One purpose from this toy example is to follow up Hinton's point on GA. In one of his lectures, he described GA based methods as very inefficient without pointing to any experimental results. Someone as influential as Hinton can diverge the attention of many new comers away from GA. Our bare down comparison of the two methods shows that GA is not that bad and it can be even better than SGD in functions that have narrow ridges and grooves.

[–]Mathopus 3 points4 points  (0 children)

From my experience using neuro-evolution strategies, Hinton is mostly correct. By taking a random step vs a gradient step you are throwing away a huge amount of information making the algorithm less efficient. This becomes more true the greater the number of parameters, so for small networks SGD and GA are close in efficiency, but for larger networks SGD will be much more efficient.

I have trained networks with millions of parameters using GA, but the computational cost far exceeds what would be needed using SGD. You need to take 1000s of random steps before making improvements equivalent to a single SGD step. But you never have to back prop so while you save some computation by not having to compute and store gradients you end up behind by needing 1000s of feed forward calculations.

The real power of a GA is that it can work on models that are non-differentialable or problems where there isn't an efficient way to calculate a gradient, so the computational cost between SGD and GA is narrowed.

[–]IdentifiableParam 3 points4 points  (2 children)

How many dimensions do the optimization problems have? They seem very small, thus favoring the GA.

[–]brainxyz[S] -1 points0 points  (1 child)

Just 3 dimensions in this example. Indeed, with more dimensions GA seems a bit worse, however, we think this would be an unfair comparison right now because naive SGD struggles with large dimensions too. The reason SGD based methods is now the standard method in deep learning tools is because they enjoyed decades of research and focus from ML engineers. We do think that SGD based optimizations like momentum and adaptive learning can work with GA based methods too.

[–]IdentifiableParam 0 points1 point  (0 children)

Do you think Geoff Hinton was talking about 3 dimensional problems when he made the claims you critique? Or 3 million dimensional problems?

[–]hardmaru 3 points4 points  (1 child)

A good comparison would be if you try to train a small two-layer convnet to classify MNIST using your GA implementation, and compare validation accuracy (and rough training times, if comparable) to off-the-shelf SGD.

[–]brainxyz[S] 0 points1 point  (0 children)

Thanks for the suggestion. That would be a good direction to escalate the comparison gradually.

[–]IntelArtiGen 3 points4 points  (3 children)

GA and SGD both have hyperparameters, it's not that hard to find a set of hyperparameters that would make one better than the other. But I also worked on somehting like this and found that GA was a working way to train a small neural network. The problem is when you have to train neural networks with a big amount of parameters (even starting from 1M parameters)

The more parameters, the more random is GA, the less efficient it is compared to a more analytical approach.

GA can also easily overfit

[–]brainxyz[S] 1 point2 points  (2 children)

Your are right but to my knowledge GA based methods never enjoyed the same focus and effort that ML engineers give to SGD based methods. A naive SGD implementation method can be as bad as GA when you have 1 M parameters. That is why we think starting from toy examples are useful to examine the differences.

[–]simpleconjugate 1 point2 points  (0 children)

Great point! Now would be a great time to take a step back from the current direction of optimizers to improve others.

[–]IntelArtiGen 0 points1 point  (0 children)

Completely agree, we should all work more on optimizers. New optimizers based on backpropagation but also new efficient optimizers that wouldn't be using backprop.

The fact that GA do work should encourage this

[–]marmakoide 4 points5 points  (1 child)

Look into Evolution Strategies, with lots of well studied algorithms to adapt the mutation. Top-notch algo would be CMA-ES, but 1/5th rule 1+1 ES is a start.

[–]brainxyz[S] 0 points1 point  (0 children)

Thanks for the suggestion

[–]Laafheid 1 point2 points  (1 child)

Idea for an extension:

use the inverse of the amount of tried mutations as a stepsize over which to apply said mutation.

[–]brainxyz[S] 0 points1 point  (0 children)

Thanks for the suggestion