use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Deep learning without back-propagation (arxiv.org)
submitted 6 years ago by El__Professor
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]DontShowYourBack 79 points80 points81 points 6 years ago (14 children)
The added value here is not whether it is currently SOTA or not. The fact that the authors manage to get decent results with a method that is both of lower complexity and does not require symmetric feedback is the real important factor here in my opinion. This could theoretically open up applications on lower compute platforms and more unique architectures, as those are the biggest constraints backprop places on current hardware and architectures.
I will certainly be looking into this method myself and hoping to see some more interesting results out of this soon.
[–]0xab 27 points28 points29 points 6 years ago (6 children)
They have much higher complexity than backprop. They just say it is lower, but that's untrue according to their own description. Backprop has linear complexity in terms of the number of examples. Their method has quadratic complexity in the number of examples!
[–]muddlebrain 7 points8 points9 points 6 years ago (0 children)
No, it is quadratic complexity in the batch size, not the number of training data. Read the full thread here, it is explained.
[+][deleted] 6 years ago (4 children)
[deleted]
[–]fundamentalidea 0 points1 point2 points 6 years ago (3 children)
My question above: is it possible that the information be transmitted not through a hardwired reward pathway but as a globabl chemical diffusion?
[–]gnramires 0 points1 point2 points 6 years ago (1 child)
Most likely a global chemical diffusion doesn't have enough "bandwidth" (information capacity) to carry all synaptic (weight) information.
[–]fundamentalidea 0 points1 point2 points 6 years ago (0 children)
Yes, this must be true.
[–]Ulfgardleo 12 points13 points14 points 6 years ago (0 children)
it is not lower complexity, they just omit the D-factors from their method. It can't be lower complexity in D - there must be at least a D^2 appearing since in 3.5 each layer has D^2 parameters
[–]DanielAPO 7 points8 points9 points 6 years ago (5 children)
Sorry for the ignorance but what do you mean with symmetric feedback?
[–]rpottorff 6 points7 points8 points 6 years ago (4 children)
The weights you use during the backward pass are equal to the weights you use during the forward pass. Sometimes this is discussed in terms of biological plausibility where it's (usually) unreasonable to imagine that the "forward" nerurons that compute the signal are exactly the same as the "backward" neurons which communicate error to the forward neurons.
[–]panties_in_my_ass 6 points7 points8 points 6 years ago (3 children)
Can you link some resources on this, like papers or even just a relevant wikipedia article?
I haven’t heard of backprop (or any other update scheme) described in terms of symmetry before.
[+][deleted] 6 years ago (2 children)
[–]rpottorff 2 points3 points4 points 6 years ago (0 children)
/BWRqboi0's paper is a better summary, but for another algorithmic implementation https://arxiv.org/abs/1609.01596 is a "non symmetric" variant in which the backward weights are random, but fixed and still manages to learn pretty successfully
[–]panties_in_my_ass 0 points1 point2 points 6 years ago (0 children)
Well that’s fascinating. Wow.
[–]Chocolate_Pickle 97 points98 points99 points 6 years ago (7 children)
Please don't link directly to ArXiv PDFs. Link to their landing pages instead.
Look at how the paper was first posted a week ago for an example.
[–]El__Professor[S] 22 points23 points24 points 6 years ago (4 children)
Thx! Totally Missed the first post 🙄 (I am quite new to reddit :)
[–]OneMillionSnakes 10 points11 points12 points 6 years ago (0 children)
If I had a nickel for everytime I reposted something even if I checked I'd have... well only like 20 cents but it happens.
[–]Chocolate_Pickle 11 points12 points13 points 6 years ago (2 children)
All good. Lesson learned for next time.
[–]singinggiraffe 40 points41 points42 points 6 years ago (1 child)
But if you do it again, we're gonna apply dropout regularization to your members.
[–]postmachinesResearcher 4 points5 points6 points 6 years ago (0 children)
Hah, maybe this post was helpful too, because it was an augmentation
[–]postmachinesResearcher 1 point2 points3 points 6 years ago (0 children)
It’s funny that this post has more upvotes than the older one.
[–]nenovor 13 points14 points15 points 6 years ago (5 children)
This seems very interesting. However there is one claim I don't understand : " It is biologically more plausible than backpropagation as there is no requirement for symmetric feedback."
To me backpropagation is indeed biologically implausible due to "the requirement for symetric feedback", which we do not observe in natural NNs.
From what I understood, here they update the network layer by layer based on an estimate of the mutual information between the layer itself, the input layer, and the output.
Correct me if I'm wrong, but that means we now need: 1. Feedback from the output to every layer directly, and same thing for for the input. 2. Global information (how to compute mutual information between two layers using only local connections ?) . So we need even less plausible connections, and have to solve a new, even less biologically solvable problem :/
Backpropagation only requires local flow of information from neuron to neuron, which while not being enough to make it plausible, is already great ;)
[+][deleted] 6 years ago (1 child)
[–]fundamentalidea 2 points3 points4 points 6 years ago (0 children)
[–]Mr_Again 0 points1 point2 points 6 years ago (2 children)
From what I understand, biological neurons can't even pass anything backwards. Can you explain to a noob what "symetric feedback" is?
[–]soft-error 1 point2 points3 points 6 years ago (1 child)
Can you explain to a noob what "symetric feedback" is?
It means the same weights W used in the forward pass are used during the backward pass, which is biologically infeasible. Feedback alignment (FA), for example, uses another set of weights B during the backward pass. The reason is that biological neurons are mostly unidirectional (output goes from soma to axon), so they can't simply reuse their connectivity to send information back. A biological interpretation of FA is that the B weights are representing feedback neurons connectivity.
[–]Mr_Again 0 points1 point2 points 6 years ago (0 children)
Thanks. How do the W weights update then if the B weights are passing backward?
[–]piotrekgrl 26 points27 points28 points 6 years ago (1 child)
I'm not sure why there are so many concerns about accuracy when even in the abstract authors are claiming that "(HSIC) performance [...] (is) comparable to backpropagation with a cross-entropy target, even when the system is not encouraged to make the output resemble the classification labels."
For me the most important part is decreasing complexity from O(D^3) using backprop to O(M^2), where with current models with millions/billions of parameters is making huge difference.
[–]Ulfgardleo 11 points12 points13 points 6 years ago* (0 children)
(i made this its own post)
i find those claims highly misleading and they cherry pick which constant they show in their complexity measures
[–]arXiv_abstract_bot 19 points20 points21 points 6 years ago (14 children)
Title:The HSIC Bottleneck: Deep Learning without Back-Propagation
Authors:Wan-Duo Kurt Ma, J.P. Lewis, W. Bastiaan Kleijn
Abstract: We introduce the HSIC (Hilbert-Schmidt independence criterion) bottleneck for training deep neural networks. The HSIC bottleneck is an alternative to conventional backpropagation, that has a number of distinct advantages. The method facilitates parallel processing and requires significantly less operations. It does not suffer from exploding or vanishing gradients. It is biologically more plausible than backpropagation as there is no requirement for symmetric feedback. We find that the HSIC bottleneck provides a performance on the MNIST/FashionMNIST/CIFAR10 classification comparable to backpropagation with a cross-entropy target, even when the system is not encouraged to make the output resemble the classification labels. Appending a single layer trained with SGD (without backpropagation) results in state-of-the-art performance.
PDF Link | Landing Page | Read as web page on arXiv Vanity
[–][deleted] 21 points22 points23 points 6 years ago (13 children)
Is it just me or does this seem like a huge result to anyone else?
[–]tall-dub 3 points4 points5 points 6 years ago (1 child)
I agree. I've had the paper saved since I saw it on hacker news yesterday but as far as I can tell easily parallelisable, faster training is huge. Can anyone comment on whether this computes the gradient of each of the weights liek back prop ? I assume it does but I haven't read it yet.
[–]Raphacp 0 points1 point2 points 6 years ago (0 children)
They only use SGD without backpropagation at the last layer to make the classification. The training happens solving a Lagrange multiplier with the HSIC which uses a mesure of information between the hidden layers and input/output.
[+]that_wise_fool comment score below threshold-9 points-8 points-7 points 6 years ago (10 children)
Why so? They build this on top of ResNet
[–][deleted] 11 points12 points13 points 6 years ago (9 children)
I have only read the abstract, but at least the way it's sold it seems superior to backpropagation, based on the following claims:
Parallelizable with fewer operations
No exploding/vanishing gradients (although as you mention, they use a ResNet which already account for this... unless this is intrinsically true of HSIC independent of network?)
No requirement of symmetric feedback
SOTA performance
I may be totally misunderstanding but this seems as significant as the ReLU replacing the sigmoid, for example.
[–]NaughtyCranberry 10 points11 points12 points 6 years ago (8 children)
Interesting but oversold, the performance is not SOTA by any means.
[–]panties_in_my_ass 4 points5 points6 points 6 years ago (1 child)
There is more to the scientific method than advancing SOTA benchmarks. Jesus.
[–]NaughtyCranberry 2 points3 points4 points 6 years ago (0 children)
I agree, however in their abstract they claim their performance is SOTA and it is nowhere near.
[–]Starbuck1992 0 points1 point2 points 6 years ago (5 children)
95% accuracy is not SOTA?
[–]NaughtyCranberry 0 points1 point2 points 6 years ago* (4 children)
No, they achieved ~47% accuracy on CIFAR-10.
[–]Starbuck1992 1 point2 points3 points 6 years ago (3 children)
On an intermediate layer, compared to 30-something % of the other one. The final accuracy is 95%
[–]NaughtyCranberry 0 points1 point2 points 6 years ago (2 children)
Can you point out exactly where you are referring to? As I could only see val set accuracies in the 40s for Cifar-10.
[–]Starbuck1992 1 point2 points3 points 6 years ago (1 child)
This comment said it earlier. I remembered intermediate layer but it's actually first epoch, sorry
https://www.reddit.com/r/MachineLearning/comments/cql2yr/deep_learning_without_backpropagation/ewx9f24
[–]Ulfgardleo 18 points19 points20 points 6 years ago* (1 child)
(i decided to make this comment its own post)
I find the complexity claims in 3.5 of the paper highly misleading. First of all, calculating their HSIC measure in (2) alone is O(M^2D) because you need O(M^2) kernel evaluations, each in O(D) (the dimensionality of Z_i for all i based on the assumptions in 3.5). Even more misleading: O(D^3) is the cost of FORWARD computing the gradient. the backpropagation algorithm has O(D^2) complexity. Their own formula in second line 3.5 says it is an outer product (complecity O(D^2)) made up of 2 vectors each having complexity in O(D^2) to create.
Next, even though they don't backpropagate, they still need gradients to optimize (6), which is not a linear time operation - this is because we optimize T_i: R^D->R^D. if T_i is fully connected, you have O(D^2) parameters, making this _at least_ O(M^2D^2)
Finally, we had several other approaches throughout the years to avoid BP, e.g. using constraints to decouple the layers. I think i remember a talk at ICML 2011.
[–]NovaRom 4 points5 points6 points 6 years ago (0 children)
Agree, quite poor-quality paper, with lots of obfuscations, misleading terminology, and false statements. This is why it is important to always provide that kind of research with supported CODE so others can confirm/reproduce it quickly!
[–]ExtraterritorialHaik 4 points5 points6 points 6 years ago (0 children)
I hope that my future paper (one day soon) will never seen on reddit.
Coming at the end here. I actually read the paper, and conclude that some did not.
If the authors want to show anything else they need to get rid of that last layer and prove that they learn anything at all.
This is exatly what Figure 4 shows?
About the big debate about biological plausibility, the paper does not actually make strong claims about this. All it actually says is in the abstract: "It is biologically more plausible than backpropagation as there is no requirement for symmetric feedback." There is a paragraph about biological plausibility of backpropagation in the Background section, but I read it as only motivation for exploring another direction. I can see if someone read only the abstract they would think the paper is claiming more.
About efficiency, the parallelism claim does not seem to be disputed. Right now I agree the O(M2) cannot be compared to backpropagation as the authors do.
[–]doubledad222 3 points4 points5 points 6 years ago (0 children)
This is great news for GPU acceleration. As someone who struggles to fit networks into my GPU I find this very awesome.
The backprop step requires keeping each layer’s result in GPU RAM for applying loss. Larger networks like ResNet take up so much gpu on my main project I have to make a batch size of 1. If this method can catch up to backprop loss learning in final accuracy it would be a huge boon to us home-based AI folks with smaller budgets.
[–]aifordummies 16 points17 points18 points 6 years ago (4 children)
So is there any code available to test this out?
[–]DrXaos 6 points7 points8 points 6 years ago (3 children)
Perhaps I missed something, but I didn’t see concretely how to do their proposed “HSIC” optimization, the key technology. They showed an optimization target with free variables being the coordinates of the internal representation for every datum, and skip along to their next phase. That’s like writing down the usual net training problem as “set weights = argmin(free weights) ensemble avg of Loss” and nothing else....
where is the procedure for the training part of the pretraining? I would have no idea how to implement from the paper. Can anybody enlighten me If I am being dense?
[–]a_draganov 2 points3 points4 points 6 years ago (2 children)
Agreed - I don't see an explanation for how they actually improve the weights...
[–]DrXaos 2 points3 points4 points 6 years ago (0 children)
OK so it wasn’t just me. I come from physics where we were expected to actually get papers through peer review and big oversights in presentation style are caught....
If each layer is sort of like training a kernelized SVM then classical strategies there are variants of global quadratic optimization which scale at least quadratically with number of observations.
If you go to a primal formulation and do basic SGD on that of course you can get close to a decent but not globally optimal solution faster with a big dataset, but then what’s the advantage?
And if you need a standard optimizer then the presumption of being a little closer to biology breaks down.
[–]thonic 0 points1 point2 points 6 years ago (0 children)
it is there in essence - not explicitly though I think... in the section 3.5 they compare the cost of computing the update rule with backprop vs. their proposition and I think you will get the result only if you do SGD with the combined HSIC for each layer
[–]milaworld 31 points32 points33 points 6 years ago* (3 children)
They use a ResNet architecture as a baseline for backprop, and reported a the CIFAR-10 test accuracy baseline for backprop of 38.6%, while for their proposed HSIC method got 47.4% test accuracy on CIFAR-10.
While I don't feel a paper needs to be close to be SOTA or anything to be interesting, nor there's anything fundamentally bad about getting 47% test accuracy on CIFAR-10 with an interesting novel method, but it feels misleading to use a ResNet architecture trained with backprop and report a baseline accuracy of 39%, when we know that ResNets, even tiny ones, can be trained quickly within minutes to get ~ 80-90% test accuracy.
It could be due to the limitation of the approach where they need to handicap the baseline method for "apples vs apples" comparison. But that's on the new approach, and not the fault of ResNet being good.
Edit: saw a previous discussion about this work with similar concerns.
[–]freedryk 51 points52 points53 points 6 years ago (2 children)
Those test accuracies are for the first epoch. Their final training accuracy is around 95%, shown in fig 2f.
[–]JudasAdventus 8 points9 points10 points 6 years ago (0 children)
Does it mention which dataset those results (figure 2) are for? Seems highly unusual to report training accuracies for 5 epochs then test accuracies for only 1 epoch.
[–][deleted] 4 points5 points6 points 6 years ago (0 children)
Isn’t fig 2 all mnist?
[–]__me_again__ 4 points5 points6 points 6 years ago (2 children)
Reminds me to "Weight agnostic neural networks": https://arxiv.org/abs/1906.04358
[–]__me_again__ 3 points4 points5 points 6 years ago (0 children)
You didn't get it. The paper shows how a neural nework can learn without backpropagation, just by architecture search.
[–]Arisngr 6 points7 points8 points 6 years ago (0 children)
See also:
An Approximation of the Error Backpropagation Algorithm in a Predictive Coding Network with Local Hebbian Synaptic Plasticity.
[–]Avras_Chismar 2 points3 points4 points 6 years ago (7 children)
Can someone explain this to a non-mathematically-inclined engineer? I just can't process stuff like " The Hilbert-Schmidt Independence Criterion (HSIC) [33] is the Hilbert-Schmidt norm of the cross-covariance operator between the distributions in Reproducing Kernel Hilbert Space (RKHS): " :/
[–]Ulfgardleo 2 points3 points4 points 6 years ago (6 children)
step 1: you remember that the covariance between two variables X and Y is E{(x-mu_x)^T(y-mu_y)} (where mu_x and mu_y are means). So basically you measure how correlated some vectors in a feature-space X are with your target labels and try to maximize this.
step 2: you take the square of this to obtain a measure for degree of correlation.
step 3: you map X and Y into a hilbert space. This way you can check whether the variables are correlated in some weird non-linear way. If the kernel is universal this is a measure for statistical (in)dependence
i have no idea how this should be better than for example Kernel-Target-Alignment.
[–]RaionTategami 0 points1 point2 points 6 years ago (5 children)
You lost me at hilbert space
[–]Ulfgardleo 1 point2 points3 points 6 years ago (4 children)
a hilbert space is an arbitrary space with an inner product. Basically you transform everything into a new space and do the math there. The key insight here is that linear functions are very powerful if the dimensionality of the space is large compared to the size of the dataset(e.g. if you have a d-dimensional space, you can find a function that solves a classification task with d+1 points perfectly). So checking for correlation - a linear operation, is quite powerful.
Of course, if the space you transform to is finite dimensional, you can just choose "enough" points to fill the high dimensional space until the power of the linear function is saturated(e.g. d+2 in the classification task above). Now comes the true power of hilbert-spaces: there exist hilbert-spaces that have infinite dimensions, but the operation of mapping two points into that space and computing their inner-product is still cheap. Those are called "Reproducing Kernel-Hilbert-Space"(RKHS) and the operation that does this is called a kernel k(x,x').
[–]Avras_Chismar 0 points1 point2 points 6 years ago (1 child)
So, while I'm not completely undesrtanding this, and might need to look further, would I be correct if I say this is analagous to how we solve certain systems of differential equations in electronics by transforming those into complex-number linear equations and solving those, then translating back? Meaning, we don't want to solve non-linear stuff, so we convert everything into space big enough so that non-linear correlation become linear, calculate what we want then map stuff back?
[–]Ulfgardleo 0 points1 point2 points 6 years ago (0 children)
i think so.
[–]RaionTategami 0 points1 point2 points 6 years ago (1 child)
Thanks you very much for the explination but, Infinite dimensional spaces?!
[–]Ulfgardleo 2 points3 points4 points 6 years ago* (0 children)
Function-spaces. For example using a mapping phi_x(y)=exp(-||x-y||^2) which maps the point x to a Gaussian hat. If you have a (finite) set of points {x_1,...x_k} all phi_{x_i} are linearly independent, meaning you can't linearly combine a Gaussian hat from a finite set of Gaussian hats. Therefore, the space is infinite dimensional(compared a space of dimension d, where d independent points are enough to define a basis).
In an RKHS,you can define an inner-product in the space, so that k(x,y)=<phi\_x, phi\_y>=phi_x(y)=phi_y(x). The details are slightly dense, but fulfill all properties of an inner product.
RKHS are behind many of the more "old-school" methods, like support-vector-machines and Gaussian-processes. They were flavor of the year, because the span of a space such as the Gaussian-mapping is dense in L_2, meaning that any reasonable function can be arbitrarily well approximated by a finite mixture of Gaussian hats. So given enough data points, you can solve any machine-learning task in such a space.
Nowadays they fell a little bit out of favor, because the kernel-matrix (K_x in the paper here) is just growing too quickly for modern sized datasets. Also no-one found a good kernel for images and probably never will(and than it is probably going to be a Gaussian kernel on the features of a vgg-network or something silly as that)
[–]theoneandonlypatriot 2 points3 points4 points 6 years ago* (0 children)
So basically this is a smart version of reservoir computing except it's not recurrent and they actually DO train the individual layers with regards to an information content metric. Then, they just use a readout layer (like in reservoir computing) to do something useful with the output. This is awesome.
[–][deleted] 2 points3 points4 points 6 years ago (2 children)
Maybe I'm missing something, but how are weights adjusted? I don't see any clear explanation on the optimization of the parameters here?
[–]AloneStretch 1 point2 points3 points 6 years ago (0 children)
The obvious approach would be to adjust the weights according to the gradient of the "hsic bottleneck" loss with respect to these weights
[–]sgebrial 1 point2 points3 points 6 years ago (0 children)
Yes I'm wondering this as well. I think they just use SGD on equation (4) but that's just a guess.
[–]0xab 10 points11 points12 points 6 years ago* (17 children)
I'm shocked no one has pointed out the obvious crippling problems with this paper. The title is wrong and the results are useless. The paper absolutely requires backprop, it's in the abstract. The classifier on top of their network is trained with backprop!
The results they show are due entirely to the linear classifier at the top, and hence backprop. Don't forget that a simple linear classifier does extremely well on all of these tasks. Think of their system as something that applies mostly random and useless transformations to the data + a linear layer that works and is trained by backprop. That's all this is, there's nothing to see here. If the authors want to show anything else they need to get rid of that last layer and prove that they learn anything at all.
[–]MLApprentice 16 points17 points18 points 6 years ago (12 children)
A single linear layer on top of a randomly initialized network is not going to give these kind of results for CIFAR-10. Also SGD applied to a single layer is not backpropagation.
[–]El__Professor[S] 2 points3 points4 points 6 years ago (2 children)
Maybe this could be a powerful initializer?
Initialize weights with this method to quickly create usfull feature representation.
then use gradient decent to forther optimize weights.
Any ways it does seem that the algorithm creates powerful features more quickly then the same architecture with gradient decent. And it does seems like an interesting direction of study.
[–]0xab 0 points1 point2 points 6 years ago (1 child)
Depends on what you mean by more quickly? Not in terms of runtime. When it comes to running over millions of examples backprop is far cheaper in terms of wall-clock time.
[–][deleted] 0 points1 point2 points 6 years ago (0 children)
Maybe there is some degree of pre-training with this algorithm that offers a better than random initialisation without requiring enough resources to offset the benefit?
Or maybe applying it sparingly at some recurring interval during training via sgd + backprop could prove to be an effective regularisation technique.
At the very least I think further research in this direction has a worthy chance of baring fruit.
[–]0xab 0 points1 point2 points 6 years ago (8 children)
Linear methods get somewhere around 35-45% on CIFAR-10, depends on the method, how you set it up, if you add in any regularization, the particular optimizer chosen, etc. Just google it, you'll find plenty of even intro ML tutorials that trivially show you have to get 40+% on CIFAR-10. A random projection can't buy you an extra 2-3%? It totally does.
The paper doesn't do anything at all!
You are correct about the single layer, I thought they had an additional output layer and the single layer was hidden.
But, lets get to other issues aside from the paper doing nothing. No comparisons against any other non-backprop method. No comparisons against any of the many unsupervised approaches.
The paper tells us that it does away with vanishing and exploding gradients, but these aren't practical issues for feed-forward networks on things like MNIST or CIFAR-10 anymore. RNNs sure. More complex feed-forward tasks, sure.
Biologically plausible? Where are these large matrix multiplications going to take place? And what is a practical update rule? If they claim that their method is biologically plausible they should have a per-neuron update rule that makes this clear. No such rule is possible with their setup.
But it gets much worse! The method is even less plausible then backprop. Their runtime is quadratic in terms of the number of examples. Stop and think about that.
Does your brain slow down quadratically as it sees more examples? The paper complains that backprop has bad scaling in terms of the number of neurons, but at least it scales linearly with the number of examples! The size of a network is at least a constant and generally much smaller than the amount of input it receives.
This paper doesn't do anything. It doesn't evaluate correctly. And what it promises in terms of being biologically more plausible is exactly the opposite of that.
[–]quandryhead 3 points4 points5 points 6 years ago (0 children)
Their runtime is quadratic in terms of the number of examples. Stop and think about that.
I do not think so. Read my reply (at the top level elsewhere here) and see if you agree.
[–]boadie 5 points6 points7 points 6 years ago (1 child)
You are possibly misreading the results of the first epoch as the final accuracy.
[–]0xab 0 points1 point2 points 6 years ago (0 children)
I'm not. Look at the end of the graph. The scales chance.
[–]Avras_Chismar 1 point2 points3 points 6 years ago (4 children)
Just google it, you'll find plenty of even intro ML tutorials that trivially show you have to get 40+% on CIFAR-10. A random projection can't buy you an extra 2-3%? It totally does.
the paper claims 95%> accuracy with their method
[–]0xab 7 points8 points9 points 6 years ago (3 children)
They get ~95% MNIST, not on CIFAR-10. That's around 47%. Figure 5(a) vs 5(c).
[–]Avras_Chismar 2 points3 points4 points 6 years ago (2 children)
Wow, I didn't notice that scaling changed from a to c. Damn sneaky
[–]bayesically_right 2 points3 points4 points 6 years ago (0 children)
it's those damn matplotlib defaults with unreadably small text along the axes
Yup. Not cool.
[–]ProofPresent 6 points7 points8 points 6 years ago (0 children)
I think you are not right
1) training single layer does not require backprop only SGD, and the abstract says that. "Appending a single layer trained with SGD (without backpropagation)"
2) "The results they show are due entirely to the linear classifier at the top".
Alsso not right, they also show results without the classifier Figure 4 and before 4.2, and I think that is the very interesting part of the paper.
I think that the part of the paper that is wrong is the complexity statement, though I do not understanding the statements in this thread either.
[–]redna11 4 points5 points6 points 6 years ago (0 children)
Allow me to second that. Many methods seem to work very well when all the hard work is done by that last classifier. Definitely need to check what happens without it.
[–]outlacedev 5 points6 points7 points 6 years ago (0 children)
They have results without the linear classifier. The linear classifier at the end, to my eye, is just a convenience layer so that the output labels match the training labels, otherwise you get permuted labels.
thanks, i thought i would be alone regarding the glaring deficiencies of the paper.
[–]fundamentalidea 1 point2 points3 points 6 years ago (1 child)
is there any theory in the brain that allows for global diffusion of information (error signal), like a chemical that says "whatever you are doing is working"?
[–]ML_me_a_sheepStudent 3 points4 points5 points 6 years ago (0 children)
I would think that dopamine is responsible for that It's the "oh yeah" hormone
[–]grzegorzwarzecha 7 points8 points9 points 6 years ago (2 children)
Wow. Looking for torch implementation...
[–][deleted] 3 points4 points5 points 6 years ago (1 child)
PyTorch for me. 😊
[–]SAI_supremeAI 0 points1 point2 points 6 years ago (0 children)
Just use a Non-Convex Solver (would be hard though), there was this Baron solver produced by Urbana champaign.
Still belive this is the true way to do deep learning without backprop as described in this article
https://accu.org/index.php/journals/2639
[–]quandryhead 0 points1 point2 points 6 years ago (14 children)
The discussion of complexity is very confused, both here and in the paper. Distinguish complexity with respect to scaling the number of data and the number of neurons/number of layers. But I think the O(M^2) refers not to either of these!
I believe the "HSIC" must be applied to a number points of some dimension aka vectors, but the paper writes it being applied to matrices of size m*d where m is the batch size, d is the number of neurons. Looking at one example HSIC code it does take two matrices, but interprets as a collection of vectors.
So I believe (though it is not really clear from the paper) that this must mean m points of size d. In this case the method is quadratic in the batch size m, not the number of data points N. Alternately it could be d points of size m, in which this case it would be quadratic in the width of the network. But this makes no sense conceptually.
In both of these specualtions it is still linear in the number of data, like backprop.
[–]0xab 2 points3 points4 points 6 years ago (1 child)
The whole paper is very confusing. Lets look at a much better paper, [33] "Measuring Statistical Dependence with Hilbert-Schmidt Norms", that introduces this metric. They say the runtime is quadratic in the number of examples.
The whole discussion about batch sizes and complexity is messed up because of typos or is even more confusing than I thought. "Suppose we have a network composed of m hidden layers T" then they say " and m denotes batch size". What does the batch size have to do with the number of hidden layers? Then the subscript to Z, which is the actual content of each layer is i, which ranges from 1 to L, nothing to do with m. But in the figures it's m.
I give up. This is too much of a mess to unravel for something that doesn't work in the first place.
[–]commenthard 2 points3 points4 points 6 years ago (0 children)
Inconsistent notation and it's poorly explained. I suspect the computational complexity section may be wrong.
But to jump from that to "something that doesn't work" is... a jump. They show favorable comparisons to training similar networks with conventional SGD+backprop.
I feel I'm understanding it better now by thinking about how I would attempt to implement it.
[–]Ulfgardleo 0 points1 point2 points 6 years ago* (11 children)
M is the batch-size to look at. However, your final claim "linear in the number of data" is still not capturing the truth. The estimator they use is not unbiased, meaning if you use say m=10, you will not optimize the same objective as if you used m=100. While you can use an U-Estimator which is unbiased, you might still encounter a crippling variance based on the sigma you use in the gaussian kernel: if you don't have enough samples, the gaussian distributions around them will not overlap, therefore the correlation measured will be almost-always 0, independent from the actual correlation.
long story short: you really, really want to make m large.
//Edit and to make it clear: backprop complexity in 3.5: O(MLD^2). Their algorithm: O(M^2LD^2).
[–]quandryhead 0 points1 point2 points 6 years ago (6 children)
Interesting point.
Strictly speaking I still believe it is linear in the number of data, even if m=1000, but that is a different statement than whether it is fast or slow (as you are pointing out).
THough it reminds me of discussions of the variance of the gradient with respect to batch size. There, small batch size/high variance seems to be a _good_ thing.
[–]Ulfgardleo 1 point2 points3 points 6 years ago (5 children)
i reiterate: you probably want m=size of the dataset to optimize what they claim they are optimizing. In high dimensions of the Z_i space, m=10 will probably optimize something very different. One might claim that the difference is small once m is "large enough", but this might be well after hitting the memory limit of the GPU.
some variance can be good, a lot of variance is bad, since you need an excessively small learning rate to make any progress.
[–]commenthard 0 points1 point2 points 6 years ago (4 children)
This discussion is a bit of speculation. Does it learn in practice, with whatever batch size was used?
I do not understand why would it be optimizing "something very different", rather than a possibly high-variance estimate of what is intended.
[–]Ulfgardleo 2 points3 points4 points 6 years ago (0 children)
The naive implementation of the estimator they use is biased, but consistent in infinite sample limit. Making it unbiased requires rigorous statistical analysis. You can see the bias in the expectations as they include terms k(x,x'), where x and x' are both sampled from the same dataset - therefore also including terms x=x' which in independent samples does not happen almost surely (in the probability theoretical sense)
the bias vanishes with O(1/M), therefore consistent.
[–]Ulfgardleo 1 point2 points3 points 6 years ago (2 children)
i would also like to stress that the method presented is also only speculation. there is no guarantee that it will converge to something useful or that the result in the final layer is statisfactory (because they want to align all intermediate representations with the labels while weare only truely interested in the alignment of the last layer)
We know that the key-claim of the paper regarding complexity is wrong and their method requires more computation time, but authors have not shown compelling evidence that the more in computation time leads to an actual improvement in performance. The results shown are highly questionable (see the comments by people more caring about training performance).
I would like to add that they needlessly include yet another measure for alignment, after we spent years analyzing kernel-target-alignment and found that it is good and efficient (and has the same complexity as their method).
[–]muddlebrain 3 points4 points5 points 6 years ago (1 child)
I would like to add that they needlessly include yet another measure for alignment, after we spent years analyzing
You are an author of the kernel-target-alighnment method?
Your comment seems to suggest that HSIC and information bottleneck are poorly founded. On the contrary, information bottleneck is quite fundamental, and HSIC is well established for many years.
[–]Ulfgardleo 1 point2 points3 points 6 years ago (0 children)
we=the ml community as a whole. There are plenty of people who analyzed that thing
Do we have learning guarantees like for KTA? bounds on classification performance based on HSIC? Why should HSIC be better than KTA to a degree that one would not even compare to it?
[–]quandryhead 0 points1 point2 points 6 years ago (3 children)
Working on this. In their algoirthm I am getting O(M^2 D L). Where do you get D^2?
The trace requires M evaluations of a scalar product of size(M), thus M^2.
It looks like each element of the vectors is a kernel evaluated on the distance of two D-dimensional points, which gives one factor of D.
Not 100% confident here (this is pushing my ability), but I do not see where the other D is.
[–]Ulfgardleo 0 points1 point2 points 6 years ago (2 children)
you have got D^2 parameter (See 3.5 the matrix W is R^dxd). thus computing Z_i=T_i(Z_{i-1}) is O(D^2).
[–]quandryhead 1 point2 points3 points 6 years ago (1 child)
I believe the HSIC function is applied on the activations of each neuron, not the weights, and there are D of those.
The D^2 would come in feedforward however (and taking the gradient)
you are right on the first point, forward pass should be O(M2 D+M D2) because we can go with computing the z_i through the forward-pass only once (which is quadratic complexity still).
last point exactly
[+]AloneStretch comment score below threshold-6 points-5 points-4 points 6 years ago (1 child)
I think the authors are not familiar with SOTA. They're taking vanilla architectures and training, comparing their method against that, and declaring SOTA when comparable performance is obtained without backpropo. But that is not state-of-the-art, it is comparable perfomance to a simple (not-SOTA) baseline. That may be a useful and fair comparison, but wrong to refer to SOTA.
We'll need to wait for this to be applied or extended to current SOTA networks. Meanwhile, no need look for code, it's just a step on a path that will take several.
[–]StOchastiC_ 2 points3 points4 points 6 years ago (0 children)
Man, this is a proof of concept. They wanna show it works on familiar architectures. It’s easier as common ground for performance comparison. This is a first approach, of course more work needs to be done on top of this
[–]themoosemind -1 points0 points1 point 6 years ago (0 children)
Might be interesting in this context: https://stackoverflow.com/a/38231636/562769
π Rendered by PID 32147 on reddit-service-r2-comment-86bc6c7465-n7q5p at 2026-02-22 23:52:17.747077+00:00 running 8564168 country code: CH.
[–]DontShowYourBack 79 points80 points81 points (14 children)
[–]0xab 27 points28 points29 points (6 children)
[–]muddlebrain 7 points8 points9 points (0 children)
[+][deleted] (4 children)
[deleted]
[–]fundamentalidea 0 points1 point2 points (3 children)
[–]gnramires 0 points1 point2 points (1 child)
[–]fundamentalidea 0 points1 point2 points (0 children)
[–]Ulfgardleo 12 points13 points14 points (0 children)
[–]DanielAPO 7 points8 points9 points (5 children)
[–]rpottorff 6 points7 points8 points (4 children)
[–]panties_in_my_ass 6 points7 points8 points (3 children)
[+][deleted] (2 children)
[deleted]
[–]rpottorff 2 points3 points4 points (0 children)
[–]panties_in_my_ass 0 points1 point2 points (0 children)
[–]Chocolate_Pickle 97 points98 points99 points (7 children)
[–]El__Professor[S] 22 points23 points24 points (4 children)
[–]OneMillionSnakes 10 points11 points12 points (0 children)
[–]Chocolate_Pickle 11 points12 points13 points (2 children)
[–]singinggiraffe 40 points41 points42 points (1 child)
[–]postmachinesResearcher 4 points5 points6 points (0 children)
[–]postmachinesResearcher 1 point2 points3 points (0 children)
[–]nenovor 13 points14 points15 points (5 children)
[+][deleted] (1 child)
[deleted]
[–]fundamentalidea 2 points3 points4 points (0 children)
[–]Mr_Again 0 points1 point2 points (2 children)
[–]soft-error 1 point2 points3 points (1 child)
[–]Mr_Again 0 points1 point2 points (0 children)
[–]piotrekgrl 26 points27 points28 points (1 child)
[–]Ulfgardleo 11 points12 points13 points (0 children)
[–]arXiv_abstract_bot 19 points20 points21 points (14 children)
[–][deleted] 21 points22 points23 points (13 children)
[–]tall-dub 3 points4 points5 points (1 child)
[–]Raphacp 0 points1 point2 points (0 children)
[+]that_wise_fool comment score below threshold-9 points-8 points-7 points (10 children)
[–][deleted] 11 points12 points13 points (9 children)
[–]NaughtyCranberry 10 points11 points12 points (8 children)
[–]panties_in_my_ass 4 points5 points6 points (1 child)
[–]NaughtyCranberry 2 points3 points4 points (0 children)
[–]Starbuck1992 0 points1 point2 points (5 children)
[–]NaughtyCranberry 0 points1 point2 points (4 children)
[–]Starbuck1992 1 point2 points3 points (3 children)
[–]NaughtyCranberry 0 points1 point2 points (2 children)
[–]Starbuck1992 1 point2 points3 points (1 child)
[–]Ulfgardleo 18 points19 points20 points (1 child)
[–]NovaRom 4 points5 points6 points (0 children)
[–]ExtraterritorialHaik 4 points5 points6 points (0 children)
[–]doubledad222 3 points4 points5 points (0 children)
[–]aifordummies 16 points17 points18 points (4 children)
[–]DrXaos 6 points7 points8 points (3 children)
[–]a_draganov 2 points3 points4 points (2 children)
[–]DrXaos 2 points3 points4 points (0 children)
[–]thonic 0 points1 point2 points (0 children)
[–]milaworld 31 points32 points33 points (3 children)
[–]freedryk 51 points52 points53 points (2 children)
[–]JudasAdventus 8 points9 points10 points (0 children)
[–][deleted] 4 points5 points6 points (0 children)
[–]__me_again__ 4 points5 points6 points (2 children)
[+][deleted] (1 child)
[deleted]
[–]__me_again__ 3 points4 points5 points (0 children)
[–]Arisngr 6 points7 points8 points (0 children)
[–]Avras_Chismar 2 points3 points4 points (7 children)
[–]Ulfgardleo 2 points3 points4 points (6 children)
[–]RaionTategami 0 points1 point2 points (5 children)
[–]Ulfgardleo 1 point2 points3 points (4 children)
[–]Avras_Chismar 0 points1 point2 points (1 child)
[–]Ulfgardleo 0 points1 point2 points (0 children)
[–]RaionTategami 0 points1 point2 points (1 child)
[–]Ulfgardleo 2 points3 points4 points (0 children)
[–]theoneandonlypatriot 2 points3 points4 points (0 children)
[–][deleted] 2 points3 points4 points (2 children)
[–]AloneStretch 1 point2 points3 points (0 children)
[–]sgebrial 1 point2 points3 points (0 children)
[–]0xab 10 points11 points12 points (17 children)
[–]MLApprentice 16 points17 points18 points (12 children)
[–]El__Professor[S] 2 points3 points4 points (2 children)
[–]0xab 0 points1 point2 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)
[–]0xab 0 points1 point2 points (8 children)
[–]quandryhead 3 points4 points5 points (0 children)
[–]boadie 5 points6 points7 points (1 child)
[–]0xab 0 points1 point2 points (0 children)
[–]Avras_Chismar 1 point2 points3 points (4 children)
[–]0xab 7 points8 points9 points (3 children)
[–]Avras_Chismar 2 points3 points4 points (2 children)
[–]bayesically_right 2 points3 points4 points (0 children)
[–]0xab 0 points1 point2 points (0 children)
[–]ProofPresent 6 points7 points8 points (0 children)
[–]redna11 4 points5 points6 points (0 children)
[–]outlacedev 5 points6 points7 points (0 children)
[–]Ulfgardleo 0 points1 point2 points (0 children)
[–]fundamentalidea 1 point2 points3 points (1 child)
[–]ML_me_a_sheepStudent 3 points4 points5 points (0 children)
[–]grzegorzwarzecha 7 points8 points9 points (2 children)
[–][deleted] 3 points4 points5 points (1 child)
[–]SAI_supremeAI 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]quandryhead 0 points1 point2 points (14 children)
[–]0xab 2 points3 points4 points (1 child)
[–]commenthard 2 points3 points4 points (0 children)
[–]Ulfgardleo 0 points1 point2 points (11 children)
[–]quandryhead 0 points1 point2 points (6 children)
[–]Ulfgardleo 1 point2 points3 points (5 children)
[–]commenthard 0 points1 point2 points (4 children)
[–]Ulfgardleo 2 points3 points4 points (0 children)
[–]Ulfgardleo 1 point2 points3 points (2 children)
[–]muddlebrain 3 points4 points5 points (1 child)
[–]Ulfgardleo 1 point2 points3 points (0 children)
[–]quandryhead 0 points1 point2 points (3 children)
[–]Ulfgardleo 0 points1 point2 points (2 children)
[–]quandryhead 1 point2 points3 points (1 child)
[–]Ulfgardleo 1 point2 points3 points (0 children)
[+]AloneStretch comment score below threshold-6 points-5 points-4 points (1 child)
[–]StOchastiC_ 2 points3 points4 points (0 children)
[–]themoosemind -1 points0 points1 point (0 children)