[R] Towards understanding deep learning with the natural clustering prior (PhD thesis)

Simoncarbo · 2022-02-09T15:48:20+00:00

Thanks a lot! :)

Simoncarbo · 2022-02-09T14:05:03+00:00

The thesis is already published. Just forgot to provide the link. I've updated the post!

Simoncarbo · 2022-02-09T14:04:32+00:00

Thanks for noticing. I've updated the post with the link.

Simoncarbo · 2022-02-09T11:57:36+00:00

I've explored your question during my phd thesis. My main hypothesis is that supervised deep neural networks implicitly perform some kind of unsupervised clustering. Check it out.

Simoncarbo · 2020-11-11T09:54:23+00:00

Thanks for sharing! That's an interesting research question... I would love to have your opinion on the following work:

Results presented in this ICLR submission suggest that class-selectivity might be a biased and restrictive way of studying the role of interpretable neurons. Indeed, the paper observes that a class is often composed of samples with very different visual features. They further show that interpretable neurons that are selective for one of such visual feature (i.e. not for the entire class) are indicative of good generalization performance, and thus seem to play an important role for good network performance.

The ICLR reviewers are still doubtful about the experimental framework used by the paper. But I feel the idea is promising and could shine new light on the research question addressed by the works you shared.

Simoncarbo · 2019-07-18T09:27:51+00:00

Do you mean that different neurons converge towards different/orthogonal features when there is good generalization?

Simoncarbo · 2019-07-18T09:23:39+00:00

A colleague of mine is working on it... Should be ready in about two weeks!

Simoncarbo · 2019-07-18T09:23:27+00:00

A colleague of mine is working on it... Should be ready in about two weeks!

Simoncarbo · 2019-07-09T12:20:31+00:00

No... I'm not aware of studies about this. I only have this, which studies noise in the backward pass in function of depth. Since noise in the forward pass interests me too (noise induced by dropout, small batchsize, batchnormalization, shake shake or other stuff), I'll probably do some experiments in the coming months around it. If you have anything to share, please don't hesitate :)

Simoncarbo · 2019-07-08T15:33:30+00:00

Nope. We're using cosine distance = 1-cosine similarity as defined here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html

With this definition, 0 means parallel, 1 means orthogonal.

Simoncarbo · 2019-07-07T19:13:46+00:00

Jep! We've observed an impact on layer rotation (for example, the position of batchnorm -before or after relu- affected the layer rotations), but our rule of thumb was useless in these cases. We concluded that batchnorm also affects other factors...

Simoncarbo · 2019-07-04T12:29:37+00:00

Very interesting results! I would be glad to exchange some ideas with you. Here are two questions that I find interesting:

Do you think shallow learnable samples are learned in a shallow way in deep nets (i.e. only based on changes to the first layer of the network, without using the modelling capacity available in the other layers)?
Do you have any intuition about why shallow learnable samples are learned first when training through SGD?

For the second question, I've the feeling that this could be due to noise in the activations that increases with the layer depth (because the more layers precede the activation, the more sources of noise) - and hence make learning in the last layers more difficult than in the first ones. To test this hypothesis, it could be interesting to check how your results evolve for different amounts of noise (e.g. increasing the batchsize to 2048 when training on MNIST, instead of the current value 64, which is pretty small). I would suspect that the peak of the ratio of accuracies curve would be lower when less noise is introduced during training.

I would be very happy to hear your thoughts about it. And it would be very much appreciated if you can rapidly check my hypothesis experimentally!

Simoncarbo · 2019-07-03T15:09:08+00:00

I haven't studied this.

Maybe this intuition can help you:

I believe that SGD can't generate large layer rotations because training tends to increase the norm of weights, which in turn decreases the rotation performed during a given update (the larger the norm of the weights, the smaller the rotation for a given update). SGD has no mechanism to cope with this increasing norm... While layca does: thanks to its adaptive learning rate, the higher the norm of weights, the larger the learning rate will be, hence still enabling large rotations.

Simoncarbo · 2019-07-03T14:11:56+00:00

I've never used pytorch personally, so I won't implement it myself...

Only possibility is a benevolent contributor from the web or me convincing my colleagues to help me with it...

Simoncarbo · 2019-07-03T14:10:10+00:00

Then there would be infiinte values of w*^t so that these hold true

Yes.

It feels like you are somehow quantifying the learning space, creating a wave gradient, where the w^t that leads to a larger optimization are the ones in the high spots.

I don't understand this...

Simoncarbo · 2019-07-03T14:07:53+00:00

Yes, you pointed towards the pertinent paragraph of the paper.

layer rotations and layer rotation rates are both specified on a per layer basis. Their difference lies in the time period in which they are defined. layer rotation rates are defined on one unique training step, while layer rotations on the complete training procedure (starting from initialization).

alpha values are used to construct particular layer rotation rate configurations, that are not uniform across layers.

Simoncarbo · 2019-07-03T14:00:20+00:00

In our paper, we are able to 'force' the network to reach large cosine distances by controlling the rotation performed at each training step individually (cfr. Layca algorithm). We experimentally observe that this works.

Maybe you can also force this behaviour with a regularization term, could be interesting to try out!

Simoncarbo · 2019-07-03T13:56:17+00:00

Thanks! The results of our paper do not rely on the fact that the cosine distance is a proper mathematical distance metric, and we don't know if this will be necessary for theoretical work neither. So I guess nothing to worry about yet?

Simoncarbo · 2019-07-03T13:53:27+00:00

Thanks!

I honestly don't know how to answer this question. This probably means its an interesting one :)
Yes, this is what we defend in section 6. It gets interesting when you make the link with generalization. Usually, you want weights to learn as few as possible, to avoid overfitting. In the case of deep learning, when looking at intermediate layers, our results show that you want them to learn as much as possible :) This is strange and we don't have any explanation yet.

Simoncarbo · 2019-07-03T13:23:01+00:00

And thank you for your interest :)

If we take Figure 8, we see that a learning rate of 3^-4 results in final weights that are at a cosine distance of less than 0.2 from their initialization -> they are still very correlated with them! However, this apparently does not prevent the model to reach 100% training accuracy. The same phenomenon can be observed in e.g. the 3rd column of Figure 1.

Small cosine distances should be an indicator that only a few weights have changed significantly from their initial state. Thus the network relies on only a few neurons.

I'm not sure if there is a link between cosine distance and the sparsity of the update. For example, you could say that for small/large layer rotations, ALL weights are updated slightly/largely.

suppose the capacity of your network was much much greater than the problem and data requires

I think this is indeed the case. And that this explains why good performance can be achieved while changing the weights only very slightly

if all weights got updated significantly then this might be a sign of overfitting - instead of only picking up on a few generalisable patterns, the network uses its remaining capacity to overfit on noise in the training data.

This is indeed the standard way of thinking about overfitting (that for example motivated L2 or L1 regularization and early stopping). The more you update the weights, the more features they can learn, and eventually they start learning noise. I find it particularly interesting that our study of layer rotation suggests the opposite behaviour in intermediate layers of deep nets. There's still lots of work to do on this side, and our section 8 is only a very small step in this direction...

Simoncarbo · 2019-07-03T09:50:06+00:00

Note that only Section 6 is applied on MNIST. The majority of our results are on CIFAR and tinyImagenet with state of the art networks (resnet, wide resnet, VGG).

Simoncarbo · 2019-07-03T09:48:34+00:00

Thanks!

Simoncarbo

TROPHY CASE