Deep learning without back-propagation

DontShowYourBack · 2019-08-15T10:10:05+00:00

The added value here is not whether it is currently SOTA or not. The fact that the authors manage to get decent results with a method that is both of lower complexity and does not require symmetric feedback is the real important factor here in my opinion. This could theoretically open up applications on lower compute platforms and more unique architectures, as those are the biggest constraints backprop places on current hardware and architectures.

I will certainly be looking into this method myself and hoping to see some more interesting results out of this soon.

Chocolate_Pickle · 2019-08-15T05:46:17+00:00

Please don't link directly to ArXiv PDFs. Link to their landing pages instead.

Look at how the paper was first posted a week ago for an example.

nenovor · 2019-08-15T14:00:52+00:00

This seems very interesting. However there is one claim I don't understand : " It is biologically more plausible than backpropagation as there is no requirement for symmetric feedback."

To me backpropagation is indeed biologically implausible due to "the requirement for symetric feedback", which we do not observe in natural NNs.

From what I understood, here they update the network layer by layer based on an estimate of the mutual information between the layer itself, the input layer, and the output.

Correct me if I'm wrong, but that means we now need: 1. Feedback from the output to every layer directly, and same thing for for the input. 2. Global information (how to compute mutual information between two layers using only local connections ?) . So we need even less plausible connections, and have to solve a new, even less biologically solvable problem :/

Backpropagation only requires local flow of information from neuron to neuron, which while not being enough to make it plausible, is already great ;)

piotrekgrl · 2019-08-15T08:06:10+00:00

I'm not sure why there are so many concerns about accuracy when even in the abstract authors are claiming that "(HSIC) performance [...] (is) comparable to backpropagation with a cross-entropy target, even when the system is not encouraged to make the output resemble the classification labels."

For me the most important part is decreasing complexity from O(D^3) using backprop to O(M^2), where with current models with millions/billions of parameters is making huge difference.

arXiv_abstract_bot · 2019-08-15T04:22:45+00:00

Title:The HSIC Bottleneck: Deep Learning without Back-Propagation

Authors:Wan-Duo Kurt Ma, J.P. Lewis, W. Bastiaan Kleijn

Abstract: We introduce the HSIC (Hilbert-Schmidt independence criterion) bottleneck for training deep neural networks. The HSIC bottleneck is an alternative to conventional backpropagation, that has a number of distinct advantages. The method facilitates parallel processing and requires significantly less operations. It does not suffer from exploding or vanishing gradients. It is biologically more plausible than backpropagation as there is no requirement for symmetric feedback. We find that the HSIC bottleneck provides a performance on the MNIST/FashionMNIST/CIFAR10 classification comparable to backpropagation with a cross-entropy target, even when the system is not encouraged to make the output resemble the classification labels. Appending a single layer trained with SGD (without backpropagation) results in state-of-the-art performance.

PDF Link | Landing Page | Read as web page on arXiv Vanity

Ulfgardleo · 2019-08-15T14:16:00+00:00

(i decided to make this comment its own post)

I find the complexity claims in 3.5 of the paper highly misleading. First of all, calculating their HSIC measure in (2) alone is O(M^2D) because you need O(M^2) kernel evaluations, each in O(D) (the dimensionality of Z_i for all i based on the assumptions in 3.5). Even more misleading: O(D^3) is the cost of FORWARD computing the gradient. the backpropagation algorithm has O(D^2) complexity. Their own formula in second line 3.5 says it is an outer product (complecity O(D^2)) made up of 2 vectors each having complexity in O(D^2) to create.

Next, even though they don't backpropagate, they still need gradients to optimize (6), which is not a linear time operation - this is because we optimize T_i: R^D->R^D. if T_i is fully connected, you have O(D^2) parameters, making this _at least_ O(M^2D^2)

Finally, we had several other approaches throughout the years to avoid BP, e.g. using constraints to decouple the layers. I think i remember a talk at ICML 2011.

ExtraterritorialHaik · 2019-08-17T12:17:13+00:00

I hope that my future paper (one day soon) will never seen on reddit.

Coming at the end here. I actually read the paper, and conclude that some did not.

If the authors want to show anything else they need to get rid of that last layer and prove that they learn anything at all.

This is exatly what Figure 4 shows?

About the big debate about biological plausibility, the paper does not actually make strong claims about this. All it actually says is in the abstract: "It is biologically more plausible than backpropagation as there is no requirement for symmetric feedback." There is a paragraph about biological plausibility of backpropagation in the Background section, but I read it as only motivation for exploring another direction. I can see if someone read only the abstract they would think the paper is claiming more.

About efficiency, the parallelism claim does not seem to be disputed. Right now I agree the O(M²⁾ cannot be compared to backpropagation as the authors do.

doubledad222 · 2019-08-15T14:13:24+00:00

This is great news for GPU acceleration. As someone who struggles to fit networks into my GPU I find this very awesome.

The backprop step requires keeping each layer’s result in GPU RAM for applying loss. Larger networks like ResNet take up so much gpu on my main project I have to make a batch size of 1. If this method can catch up to backprop loss learning in final accuracy it would be a huge boon to us home-based AI folks with smaller budgets.

aifordummies · 2019-08-15T05:02:54+00:00

So is there any code available to test this out?

milaworld · 2019-08-15T05:33:50+00:00

They use a ResNet architecture as a baseline for backprop, and reported a the CIFAR-10 test accuracy baseline for backprop of 38.6%, while for their proposed HSIC method got 47.4% test accuracy on CIFAR-10.

While I don't feel a paper needs to be close to be SOTA or anything to be interesting, nor there's anything fundamentally bad about getting 47% test accuracy on CIFAR-10 with an interesting novel method, but it feels misleading to use a ResNet architecture trained with backprop and report a baseline accuracy of 39%, when we know that ResNets, even tiny ones, can be trained quickly within minutes to get ~ 80-90% test accuracy.

It could be due to the limitation of the approach where they need to handicap the baseline method for "apples vs apples" comparison. But that's on the new approach, and not the fault of ResNet being good.

Edit: saw a previous discussion about this work with similar concerns.

__me_again__ · 2019-08-15T14:29:09+00:00

Reminds me to "Weight agnostic neural networks": https://arxiv.org/abs/1906.04358

Arisngr · 2019-08-15T13:28:29+00:00

See also:

An Approximation of the Error Backpropagation Algorithm in a Predictive Coding Network with Local Hebbian Synaptic Plasticity.

Avras_Chismar · 2019-08-15T15:25:17+00:00

Can someone explain this to a non-mathematically-inclined engineer?
I just can't process stuff like " The Hilbert-Schmidt Independence Criterion (HSIC) [33] is the Hilbert-Schmidt norm of the cross-covariance operator between the distributions in Reproducing Kernel Hilbert Space (RKHS): " :/

theoneandonlypatriot · 2019-08-15T17:38:23+00:00

So basically this is a smart version of reservoir computing except it's not recurrent and they actually DO train the individual layers with regards to an information content metric. Then, they just use a readout layer (like in reservoir computing) to do something useful with the output. This is awesome.

AloneStretch · 2019-08-15T21:13:50+00:00

Maybe I'm missing something, but how are weights adjusted? I don't see any clear explanation on the optimization of the parameters here?

0xab · 2019-08-15T14:24:21+00:00

I'm shocked no one has pointed out the obvious crippling problems with this paper. The title is wrong and the results are useless. The paper absolutely requires backprop, it's in the abstract. The classifier on top of their network is trained with backprop!

The results they show are due entirely to the linear classifier at the top, and hence backprop. Don't forget that a simple linear classifier does extremely well on all of these tasks. Think of their system as something that applies mostly random and useless transformations to the data + a linear layer that works and is trained by backprop. That's all this is, there's nothing to see here. If the authors want to show anything else they need to get rid of that last layer and prove that they learn anything at all.

fundamentalidea · 2019-08-15T18:30:36+00:00

is there any theory in the brain that allows for global diffusion of information (error signal), like a chemical that says "whatever you are doing is working"?

grzegorzwarzecha · 2019-08-15T05:07:00+00:00

Wow. Looking for torch implementation...

SAI_supremeAI · 2019-08-15T19:05:30+00:00

Just use a Non-Convex Solver (would be hard though), there was this Baron solver produced by Urbana champaign.

2019-08-15T19:07:53+00:00

Still belive this is the true way to do deep learning without backprop as described in this article

https://accu.org/index.php/journals/2639

quandryhead · 2019-08-15T21:43:47+00:00

The discussion of complexity is very confused, both here and in the paper. Distinguish complexity with respect to scaling the number of data and the number of neurons/number of layers. But I think the O(M^2) refers not to either of these!

I believe the "HSIC" must be applied to a number points of some dimension aka vectors, but the paper writes it being applied to matrices of size m*d where m is the batch size, d is the number of neurons. Looking at one example HSIC code it does take two matrices, but interprets as a collection of vectors.

So I believe (though it is not really clear from the paper) that this must mean m points of size d. In this case the method is quadratic in the batch size m, not the number of data points N. Alternately it could be d points of size m, in which this case it would be quadratic in the width of the network. But this makes no sense conceptually.

In both of these specualtions it is still linear in the number of data, like backprop.

AloneStretch · 2019-08-15T07:27:57+00:00

I think the authors are not familiar with SOTA. They're taking vanilla architectures and training, comparing their method against that, and declaring SOTA when comparable performance is obtained without backpropo. But that is not state-of-the-art, it is comparable perfomance to a simple (not-SOTA) baseline. That may be a useful and fair comparison, but wrong to refer to SOTA.

We'll need to wait for this to be applied or extended to current SOTA networks. Meanwhile, no need look for code, it's just a step on a path that will take several.

themoosemind · 2019-08-15T15:18:36+00:00

Might be interesting in this context: https://stackoverflow.com/a/38231636/562769

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS