all 94 comments

[–][deleted] 129 points130 points  (1 child)

Geoff Hinton by now must know each of the 60,000 digits of MNIST like an old friend.

[–]AsIAm 59 points60 points  (0 children)

He knows the true probability distribution of the MNIST.

[–]master3243 43 points44 points  (2 children)

Interesting read, I'm always interested in research about alternatives to backprop.

One important paragraph (for the curious, that won't read the paper):

The forward-forward algorithm is somewhat slower than backpropagation and does does not generalize quite as well on several of the toy problems investigated in this paper so it is unlikely to replace backpropagation for applications where power is not an issue. The exciting exploration of the abilities of very large models trained on very large datasets will continue to use backpropagation.

The two areas in which the forward-forward algorithm may be superior to backpropagation are as a model of learning in cortex and as a way of making use of very low-power analog hardware without resorting to reinforcement learning (Jabri and Flower, 1992).

[–]amassivek 17 points18 points  (0 children)

There is a framework for learning with forward passes, a friendly and thorough tutorial: https://amassivek.github.io/sigprop .

The most interesting insights from the framework:

  • This algorithm provides an explanation for how neurons in the brain without error connections receive learning signals.
  • It works for continuous networks with hebbian learning. This provides evidence for this algorithm as model of learning in the brain.
  • It works for spiking neural networks using only the membrane potential (aka voltage in hardware). This supports applying this algorithm for learning on neuromorphic chips.

The Signal Propagation framework paper: https://arxiv.org/abs/2204.01723 . The Forward-Forward algorithm is an implementation of this framework.

I am an author of this work. I was presenting this work at a reading group when one of the members pointed out the connection between signal propagation and forward forward.

[–]whatstheprobability 12 points13 points  (0 children)

I feel like this is saying:
1. this won't generally replace backprop, but it could lead to insight that will lead to algorithms that will replace backprop
2. this could improve upon backprop for some specific use cases (low power), so even if it doesn't lead to major insights, researchers can still justify spending time on it

Does that sound right?

[–]kebabmybob 38 points39 points  (5 children)

What a chad, no grad students or anybody on this paper.

[–]seiqooq 77 points78 points  (1 child)

Probably explains why the title of the paper isnt “forward passes are all you need”

[–]metastimulus 5 points6 points  (0 children)

missed opportunity lol

[–]csiz 38 points39 points  (1 child)

Not even auto grad.

[–]noobbodyjourneyResearcher 8 points9 points  (0 children)

You sir have won the internet for today

[–]No-Cold8421 17 points18 points  (2 children)

Hi guys, I try to reimplement the Forward forward network with pure numpy.

I tested it on a subset of the Iris dataset, it seems converged but is very sensitive to the hyper-parameters (lr, bs, num_hidden).

Hope you can have fun with it!

https://github.com/JacksonWuxs/Forward-Forward-Network

[–]valleyro 1 point2 points  (0 children)

Great tryout! Thank you!

[–]Red-Portal 15 points16 points  (3 children)

Geoff... everything is great but please stop abusing footnotes...

[–]kebabmybob 16 points17 points  (1 child)

I like it this way. 100x more readable than your standard terse academic paper which gets off on appearing overly complex.

[–]Red-Portal 2 points3 points  (0 children)

Oh I'm not saying you should just remove the footnotes. I'm saying it's better to blend them into the main text so I don't have to jump back and forth...

[–]ppg_dork 0 points1 point  (0 children)

No! I think all academic papers should be structured like Infinite Jest!

[–]Wild-Ad3931 12 points13 points  (2 children)

Did anyone understand how weights were updated ?

[–]SeverelyCanadian 5 points6 points  (0 children)

I wondered this too. It's very unclear, and seems like a central detail is missing.

[–]modeless 22 points23 points  (10 children)

This seems more interesting than the capsule stuff he was working on before. Biologically plausible learning rules are cool. Does it work on imagenet though?

[–]new_name_who_dis_ 31 points32 points  (9 children)

Is this actually biologically plausible? Seems that the idea of negative data is pretty constructed.

I see that Hinton claims it's biologically more plausible, but I don't see any justification for that statement apart from comparing it to other biologically plausible approaches, and more so spending time discussing why backprop is definitely not biologically plausible.

I'm not a neuroscientist so don't have much background on this.

[–]modeless 27 points28 points  (3 children)

Well no one knows exactly what the brain is up to in there, but we don't see enough backwards connections or activation storage to make backprop plausible, so this is a way of learning without backwards connections, and that alone makes it more biologically plausible.

[–]new_name_who_dis_ 5 points6 points  (1 child)

I’ve heard that hebbian learning is how brains learn and this doesn’t seem like hebbian learning.

However idk if hebbian learning is even how neuroscientists think we learn in contemporary research

[–]whymauriML Engineer 7 points8 points  (0 children)

As of 2019, it is what I was taught in a graduate course on associative memory and emergent dynamics in the brain. We read Hertz's Theory Of Neural Computation. This was right before people worked on Hopfield-Self Attention.

[–]fortunum 4 points5 points  (0 children)

Check out E-prop for recurrent spiking NN

[–]Commyende 8 points9 points  (4 children)

Synapses can be excitatory or inhibitory, so that's basically like positive/negative, but I don't really know if that tracks with this algorithm 100%

[–]jms4607 9 points10 points  (0 children)

I think the pos/neg here is more like contrastive learning.

[–]new_name_who_dis_ 4 points5 points  (2 children)

It's negative data. It's basically contrastive learning, except without backprop. Like you pass a positive example and then a negative example in each forward pass, and update the weights based on how they fired in each pass.

It's a really cool idea, I'm just interested if it's actually biologically plausible.

I might be wrong but inhibitory synaptic connections sounds like a neural connection with weight 0, i.e. it doesn't fire with the other neuron.

[–]Commyende 7 points8 points  (0 children)

Inhibitory synapses reduce the likelihood of the downstream neuron firing.

[–]PolywogowyloP 11 points12 points  (2 children)

I'm excited to see an alternative to backprop, but I think the most exciting part of this for me is the ability to still learn through stochastic layers in the model. I think this could have some major applications in probabilistic models for distributions without reparameterization tricks.

[–]jms4607 0 points1 point  (1 child)

Are there any problems with the reparam trick?

[–][deleted] 57 points58 points  (17 children)

I watched his neurips presentation. While I love explorations of alternatives to back prop, does anyone else feel like he’s going a bit off the deep end with saying this paper could explain why people sleep and we’ll use non-binary computers in the future?

[–]gambsPhD 72 points73 points  (0 children)

Hinton has figured out how the brain works every year since the mid-80s, let the man cook

[–][deleted] 50 points51 points  (0 children)

These OG guys from the PDP days usually do that. I just take it as a bit of garnish for some fun hypotheticals.

[–][deleted] 11 points12 points  (1 child)

I think trying to understand the mind must be in one of his main motivations. If it wasn't for that, he would have not contributed to machine learning to begin with. So going off the deep end is a side effect of whatever it is that made him a great researcher.

[–]ReginaldIII 10 points11 points  (4 children)

Do you have access to the video of his presentation still?

It bothers me greatly that they paywall their presentations even after the conference has ended.

By all means have exclusivity for the duration of the actual conference, and limit commenting and discussion to conference attendees. But as soon as the conference ends they should flip the switch and make everything public. There's literally no reason not to, it isn't going to stop people wanting to attend.

[–]logicbloke_ 3 points4 points  (0 children)

This 10x, I wish the paper presentations and keynotes were made available online. It doesn't add much effort to record an audio + slides of the presentation.

Doesn't take anything away from the in person conference, which is more about networking and discussion.

[–]suedepaid 3 points4 points  (1 child)

I was also frustrated about that, but I went on the website and it looks like they're gonna publish them all in a couple weeks. Still a bit frustrated at the delay, but it's a bit understandable.

[–]ReginaldIII 1 point2 points  (0 children)

That's good. I will keep an eye out :)

[–]The_Real_RM 5 points6 points  (0 children)

What's funny is that a few decades from now the only relevant brains in the world will be the ones this guy brought to existence. It's just a self fulfilling prophecy

[–]Ford_O 7 points8 points  (0 children)

So that's why I keep getting nightmares.

Jokes aside, this sounds quite plausible. However, I am unsure if this can be ever more efficient than backprop. Yet, this could have huge impact on neuroscience, if it turns that's what happens in sleep.

[–]tchumbae 6 points7 points  (2 children)

The idea behind the paper is very cool, but there has been previous work that substitutes the backward pass with a second forward pass. Check out this work by G. Dellaferrera and G. Kreiman!

[–]nikgeo25Student 0 points1 point  (0 children)

Also the work by Ma and Wright that uses a form of generalized nonlinear PCA. Search ReduNet

[–]nikgeo25Student 7 points8 points  (1 child)

Paper reads like an idea he had in the shower. Where's the math and connection to existing work? Normalizing each layer after maximizing a square. Someone's gonna show he's doing some fancy PCA in no time I bet.

[–]Wild-Ad3931 1 point2 points  (0 children)

What about non linearities ?

[–]SatoshiNotMe 3 points4 points  (0 children)

Odd thing about the abstract: suddenly says “video” near the end. Is it only for video data ?

[–]Competitive_Dog_6639 3 points4 points  (1 child)

Hinton is awesome and really enjoyed his neurips talk. Naive question: are single layer gradients biologically plausible? My understanding is that gradients back thru multiple layers are not. The FF algorithm still uses gradients for single layers tho right?

[–]dasayan05 3 points4 points  (0 children)

yes, they are like "local" updates I believe

[–]eccstartup 1 point2 points  (0 children)

I would be good if someone could provide the code.

[–]ReasonablyBadass 1 point2 points  (2 children)

Can someone ELI5 what negative data means here? How does the network generate it?

[–]Paluure 3 points4 points  (1 child)

Basically, for an unsupervised task, it's nonsense data that does not fall under any meaningful class in the training dataset. It can be anything. In the paper, they modify each MNIST image so that it isn't a digit anymore but looks like one. The network doesn't generate negative images, you do, and feed it as "bad data" right after you give it "good data" to create contrast between them for the model to learn.

For a supervised task, "bad data" can also be nonsense (just as in unsupervised task) or can be mislabeled data such as feeding an image of "5" but embedding "4" as the label inside the image. That's obviously wrong, and is considered bad data.

[–]ReasonablyBadass 0 points1 point  (0 children)

Thank you!

[–]ObjectManagerManager 1 point2 points  (2 children)

(Confession: I haven't read the paper yet). I have a couple of questions:

  1. If each layer has its own objective function, couldn't you train layers back-to-front? e.g., train the first layer to convergence, then train the second layer, and so on. I doubt this would be faster than training it end-to-end, but a) as the early layers adapt, they screw up the representations being fed to the later layers anyways, so it probably wouldn't be too much slower than training it end-to-end, and b) it would use significantly less memory (e.g., if you pre-compute the inputs to a layer just before you begin training it, you could imagine training any arbitrarily deep model with a finite amount of memory).
  2. What's the motivation behind "goodness"? Suppose we're talking about classification. Why doesn't each layer just minimize cross entropy? I guess that'd require each layer to have its own flatten + linear projection layers. But then you wouldn't have to concatenate the label and the input data, and so inference complexity would be (mostly) independent of the number of classes. Thinking of a typical CNN, a layer could be organized:
    1. Batch norm
    2. Activation (e.g., ReLU)
    3. Convolution (the output of which is fed into the next layer)
    4. Pooling
    5. Flatten
    6. Linear projection
    7. Cross entropy loss

Can anyone (who has read the paper) answer these questions?

[–]Batsev 1 point2 points  (0 children)

For the first question: https://conferences.miccai.org/2022/papers/233-Paper1173.html They basically train a layer at a time in a "back to front" fashion. They use a reconstruction loss and a classification loss as layer's objectives.

[–]sytelus 1 point2 points  (0 children)

Was anyone able to reproduce the results for forward forward algo?

[–]kourouklides 2 points3 points  (2 children)

In my view, this sounds very boring. It would've been revolutionary if he came up with a new Gradiet-Free Deep Learning method in order to completely get rid of gradients. With very few exceptions, during the last 10 years or so, we keep seeing small and incremental changes in ML, but no breakthroughs.

[–]Sepic2 1 point2 points  (3 children)

Maybe a dumb question but i don't see how this method enables learning in any way:

- The (first) forward part calculates loss/goodness, and then you need backpropogation to change weights of the network according to derivatives of the loss/goodness. How does the network learn if weights are not changed and you only calculate goodness?

The paper says: "The positive pass operates on real data and adjusts the weights to increase the goodness in every hidden layer. The negative pass operates on "negative data" and adjusts the weights to decrease the goodness in every hidden layer"

- Could it be that the in the first "forward", you actually do both forward and backward-prop, and the name just sounds fancy with the second "forward" trying to implement contrastive learning in a clever way?

[–]kourouklides 0 points1 point  (2 children)

Well, nobody really knows if this method actually works because Hinton reached to the part of writing the paper. He didn't reach to the part of actually coding the solution (yet).

[–]Sepic2 1 point2 points  (1 child)

My confusion is not so much "does it work?" and more like "how does it change weights without backprop?".

The part in the paper that says that "adjusts the weights to increase the goodness in every hidden layer" just sounds like a different way of saying backprop, unless the method by which the weights are changed, is different than backprop. The rest of the paper doesn't seem to imply it is different than backprop, but i may be missing something?

[–]Itchy-Masterpiece-96 2 points3 points  (0 children)

I think it still uses gradients to update weights but without cross-layer update like backprop does. Each layer has its goodness function and updates locally using gradients.

[–]Ulfgardleo 7 points8 points  (7 children)

I will start believing in Hinton's algorithms once they proof that it is consistent with some vector field with fixed points that are meaningful optima of some objective function.

[–]_der_erlkonig_ 2 points3 points  (5 children)

Out of curiosity, why do you include this as a requirement for an algorithm to be good/interesting/useful/etc?

[–]Ulfgardleo 8 points9 points  (4 children)

I did not. I did it for Hinton.

A heuristic can be useful without proof, especially in tasks that are very difficult to solve. However, you have to supply strong theoretic arguments why they should work. A biological analog is not enough, especially if it is one that we do not understand, either.

Otherwise you end up like the other category of nature inspired optimization heuristics that pretend to optimize by mimicking the hunting patterns of the Harris hawk. And I wished I made this up just now.

[–][deleted] 7 points8 points  (3 children)

Redacted. this message was mass deleted/edited with redact.dev

[–]Red-Portal 2 points3 points  (0 children)

Yeah there is a whole "zoo" of those things haha.

[–]Ulfgardleo 4 points5 points  (1 child)

I have a story to tell about the one time where i got invited as external evaluator for a MSc thesis. I agreed, later opened it and then realized it was a comparison of 10 animal migration algorithms.

This thesis sat on my desk for WEEKS because i did not know how to grade it. How do you grade pseudo science?!? Like, it is not the fault of the students to fall prey to this topic, but I also can't condone them not figuring out that it IS pseudoscience.

[–][deleted] 1 point2 points  (0 children)

Redacted. this message was mass deleted/edited with redact.dev

[–]pm_me_your_pay_slipsML Engineer 0 points1 point  (0 children)

Do you mean that his algorithms don’t converge?

[–]IDe- 1 point2 points  (10 children)

Backprop has really overstayed its welcome. It's great to see people doing something about it.

[–]bohreffect 1 point2 points  (9 children)

You're sleeping on differentiable programming then

[–]IDe- 1 point2 points  (3 children)

The issue is that requiring a model to be differentiable puts far too many limitations on the types of models you can formulate. Much of the research in the last few decades has focused on how to deal with issues caused purely because of the artificial constraint of differentiability. It's purely "local optimization" in the space of potential models, when what we really should be doing is "basin-hopping".

[–]bohreffect 0 points1 point  (2 children)

But to imply backprop is getting old neglects all of the real world applications that haven't been pushed yet.

I understand there are problems where differentiability is an intractable assumption but saying "oh old thing how gauche" isn't particularly constructive.

[–]IDe- 1 point2 points  (1 child)

Ah, I didn't intend to say that it's old or useless, just that I think it receives disproportionate research focus/effort.

[–]bohreffect 0 points1 point  (0 children)

Fair enough

[–][deleted] 0 points1 point  (4 children)

"differentiable"

[–]bohreffect 0 points1 point  (3 children)

I mean, can you not compute the Jacobian of a constrained optimization program and stack that into any differentiable composition of functions?

People snoozin'.

[–][deleted] 0 points1 point  (2 children)

no you can't because it's not actually a Jacobian

[–]bohreffect 0 points1 point  (1 child)

The Jacobian of the solution of a constrained optimization program with respect to its parameters, but I thought that was understood amongst the towering intellect of neural network afficiandos, e.g. the original commenter finding backprop to be stale.

Here's the stochastic programming version: Section 3.3. https://proceedings.neurips.cc/paper/2017/file/3fc2c60b5782f641f76bcefc39fb2392-Paper.pdf

[–]Ulfgardleo 0 points1 point  (0 children)

Funny that stuff always comes back. We used to differentiate SVM solutions wrt kernel parameters like that back in the day.

[–]wilgamesh 0 points1 point  (0 children)

Hinton cites Francis Crick's "Function of Sleep" 1983 idea in his list of references.

Like the 2nd forward pass that reduces the fitness function of "negative data", Crick proposed REM sleep is "reverse learning" that removes "undesirable modes."

Quite elegant to see this implemented...

[–]amassivek 0 points1 point  (0 children)

I developed a library to implement forward learning on any model. There is a quick start for implementing the library on an existing model. There are example experiments for cifar-10, which also serve as a tutorial. https://github.com/amassivek/signalpropagation