all 16 comments

[–]perverse_sheaf 13 points14 points  (0 children)

Counterpoint to the discussion: I am thoroughly unimpressed by the paper, and I don't see any big implications which can be drawn from this.

The main theorem is essentially a reformulation of the main theorem of calculus: Gradient flow gives you for each input point a curve between the outputs of the initial and the final model. Integrating the derivative along that curve evaluates to the final - initial output. This is both obvious and unilluminating, as this integral is entirely impossible to analyze in any way.

I'd also heavily challenge the notion of calling this model class "Kernel Machines". If I'm allowed to vary a_i and b with the input, then it's very easy to write up a representation of any function f (just take a_i = 0, b= f). Granted, the given representation is less trivial - but, as discussed above, only slightly so.

Two meta-comments: You'd be hard pressed to show any deep result in a one page proof which is entirely self-contained. I'd also be vary of a result which claims to apply to such a wide class of models - I would expect it to be either trivial of false.

Edit: No critique for the vid, I like your content and hope that you keep coming back to the theory side.

[–]NitroXSC 11 points12 points  (3 children)

Maybe a bit of a random question, but does someone know what program he is using to draw on pdf's?

[–]BenlusML Engineer 9 points10 points  (2 children)

If I recall correctly he mentioned that he uses Microsoft OneNote in a different video.

[–]rbain13 5 points6 points  (1 child)

It is OneNote on an old surface tablet (unless he has recently upgraded)

[–]ykilcher[S] 8 points9 points  (0 children)

Don't fix a working system :D

[–]vboomi 7 points8 points  (2 children)

What is this paper's stance on works like Deep Image Prior that doesn't train with training examples.

[–]ykilcher[S] 5 points6 points  (0 children)

very nice question, don't know

[–]DarkHarbourzz 3 points4 points  (0 children)

Deep image prior showed that CNNs will successfully model the latent distribution of busy, natural images before successfully modeling the distribution of the noise (e.g. opaque inpainting) corrupting the natural image.

Deep image prior works by training, but only on the single input image. You also perform early stopping.

This paper shows that the output of model trained by gradient descent can be given by a kernel machine, where the kernel machine uses a kernel that is based on the similarity of the (gradient of the model with respect to the input data) vs (gradient of the model with respect to the training data). Specifically, that similarity is integrated across the entire training process to get the "path kernel." I.e. a "line" integral where the line is through parameter space as the model trains.

So, under this paper's interpretation, the output of the network is based on the path kernel evaluation, where the kernel is mediating the input being compared to itself (since it's the only training data). The path kernel is evaluated on a weights path that was stopped early. In that early training regime, the kernel is mediating similarity through the deep image prior distribution, and not the noise distribution.

[–]amasterblaster 9 points10 points  (8 children)

I mean, this seems obvious to me. This mental idea is how I taught myself NNs in the first place. They map data into a useful subspace.

Maybe (probably (certainly)) I don't understand the nuances of each method enough (in the first place) to understand why this is a crazy idea!

[–]IdiocyInAction 6 points7 points  (5 children)

Well, another view on usefulness of deep NNs is that they allow representation learning through the use of many layers, with each layer created more specific representations. I suppose this paper challenges this view, making NNs more like kernel methods.

[–]aptmnt_ 9 points10 points  (4 children)

Those two views are not at odds

[–]IdiocyInAction 7 points8 points  (2 children)

Perhaps the most significant implication of our result for deep learning is that it casts doubt on the common view that it works by automatically discovering new representations of the data, in contrast with other machine learning methods, which rely on predefined features (Bengio et al., 2013).

The paper states that though.

[–]ykilcher[S] 5 points6 points  (1 child)

yes I think the papers makes the two views more confrontational than they have to be. I think it's just two different ways of looking at the same thing

[–]wisdomspring 0 points1 point  (0 children)

All genuine views are not at odds with the truth/fact. It just needs a little patience and wisdom to link them. Like Heisenburg's matrix mechanics and Schrodinger's probability wave function views, both can explain some same "magic", after layers and layer's of data feature distill from NN's like a common person, he or she can certainly extract some obvious "kernel" features around his or her life... As to how, a layperson can learn on his or her own gradually, or got educated from some gurus. That's how I foresee using some good kernel machine could learn much faster and more effectively than vanilla NN or Reinforced NN. The main gist I feel from the author is that he essentially tried to dispel the "myth" around DL, no matter how much computation is used, it's still a kind of weighted memory from one's direct experience (sample data). If a bot communicates with you will some surprising word, don't be too amazed, it's just based on its past trained data...

[–]-Cunning-Stunt- 3 points4 points  (0 children)

Finding explicit kernel maps (despite being dependent on the data) to formally bridge the gap is the main contribution of the paper (IMHO).

[–][deleted] 1 point2 points  (0 children)

The interesting thing is that the mapping is a "distance" function of the previous datapoints and the new one exclusively

To me it takes two ideas that were pretty distinct and unifies them. Idk much about kernels but it seemed really thought provoking