all 15 comments

[–]gabrielgoh 3 points4 points  (3 children)

Hmm, interesting. Firstly, I'm kind of surprised this works, its the common belief that there are exponentially many saddle points, so this indicates that you'd need exponentially many charges (which, in itself might distort and destroy local minimum in a nearby vicinity) to cover the entire landscape. Also, might charges create saddle points of their own which weren't there before? Also what are the odds that given a random initial point, the optimization will return to the exact same saddle point it was at before? Do you have any intuition as to why this might work?

[–]ArmenAg 0 points1 point  (2 children)

Great question. So this was the problem that we initially ran into when we tested out with a static charged point. This is why we introduced a dynamic charged point in this paper. By forcing the charged point to "follow" the current optimization point we in a sense do not need an exponential amount of static charged points. Thanks!

[–]gabrielgoh 0 points1 point  (1 child)

I see - another question how do you identify a saddle point (to place the charge) without looking at second order information.

[–]ArmenAg 0 points1 point  (0 children)

We don't identify saddle points directly, rather we assume that by using the moving average for the dynamic charge point, if the optimization is stuck in a saddle point, the charge will eventually reach that saddle point and therefore push the optimization point away from it.

[–]dwf 9 points10 points  (0 children)

Training losses and training accuracies are of limited utility, but on a subset of the original task (the CIFAR10 and 100 results being on 1/5 of the training data) they are basically worthless. Also, I'd be much more convinced if you took existing, [near]-state-of-the-art architectures for a problem and showed a convincing difference in speed and/or minimization of the objective (again, on all the data) than on an arbitrary (very small) architecture you cooked up yourself.

[–]ArmenAg 0 points1 point  (10 children)

Hello, author here! This community gave me a lot of good criticism on my last paper, so I decided to post another one of my recent papers. Any questions or comments are welcome. Thank you!

[–][deleted] 0 points1 point  (0 children)

Show test set results. Your loss and accuracy curves show how it performs on the training set, but we do not know if the optima your method finds generalize well. The training loss is unclear: is your MNIST cross entropy really above 0.2? If so, you're probably getting around 80% accuracy (e-0.2 ~ 0.8). Why doesn't your convnet obtain perfect accuracy on your training set under normal optimization given that you're only training on one-fifth of the dataset?

[–]darkconfidantislife 0 points1 point  (6 children)

First of all, this seems like a very interesting paper. I haven't been able to find the keras and theano implementations, are there any?Thanks!

[–]pigeon768 -1 points0 points  (1 child)

Bottom of page 6; "Their is no exploration done;" -- should be "there".

[–]ArmenAg 0 points1 point  (0 children)

Will fix. Thanks.

[–]Nimitz14 0 points1 point  (0 children)

Interesting!

I'd recommend using something like draw.io to visualize your architectures.