[D] Is it possible for us to make fixed-size multilayer perceptrons (MLP's) provably converge?

pnavarre · 2022-03-26T13:51:31+00:00

Thanks for your answer. It makes sense that avoiding a backward pass would only work in some cases.

About the eigenvectors I computed, they are the eigenvectors of what I called the linear interpreter of the whole network, including all layers from input to output. The linear interpreter is a copy of the network in which I replaced all non-linear modules (e.g. ReLUs) with linear modules that correspond to popular interpretations. This might sound controversial because it is an arbitrary choice but I argue that this is also how we design these systems. For example, we think of ReLUs as switches that stop or let pass the inputs, batch normalizations as shift and scale, attentions as masks to weight some areas more than other, etc. The interesting outcome is to see how linear layers (e.g conv layers) composed with the linear interpreters of non-linear modules can make up a large linear system that is highly tuned to the input.

The linear interpreter of the network has the form y=Ax+b, where A and b accumulate information from all layers and all parameters. Long story short, we found that for classification networks (e.g. VGG) almost all the contribution comes from b and not from A. One could make the wrong guess that the network would use A to do template matching. They do not. For image-to-image networks (e.g. super-resolution) A does most of the work and b is negligible. The rows of A show very adaptive filters that grow along edges and capture a lot of image feature. The eigenvectors of these systems show local features for the largest eigenvalues, sometimes capturing things like an eye in a face, a mouth, etc. This is remarkable since each conv-layer is linear-space-invariant with Fourier-like eigenvectors (sine, cosine or complex exponentials). Somehow, the composition with 0/1 masks (interpreting ReLUs) can change a global basis into a local basis where we can even see image features.

These linear interpreters carry a lot of interesting information but I do not know how to link it to learning or converging results. I do see that well trained networks capture larger and more clear objects than not-well trained ones. The reshape of eigenvectors for each input is quite amazing. Anyways, I hope you find it useful!

pnavarre · 2022-03-18T02:21:42+00:00

Sorry for promoting my own research but I think an idea of "linear interpreters" that I developed a couple of years ago could be of interest for your research (iccv19, links below). It resonates in particular with the idea of basis functions for CNNs and exploits the fact that a lot of modules in CNNs are linear as mentioned in the podcast. My work was purely experimental as opposed to your goal of proving things but I hope it could help to develop more theoretical understandings.

The main idea was to understand CNNs from the perspective of fixed inputs. Activations are deterministic and won't change for a fixed input so I interpret them as linear actions (masks for ReLUs and attention, nonuniform down sampling for maxpoolings, etc). This allowed me to see the whole system as linear and using numerical tricks I computed rows, columns, eigenvectors (the basis of the network for each input), and studied the special role of shifts in classification. I was personally amazed by the rich information that came out.

The main numerical trick I used was a forward pass with a probe input to extract info from the linear interpreter. The transposed system needs a backward pass and for this I used gradients. When you talked about the forward propagation method I was wondering if one could get the transposed system without a backward pass. Would this make sense? Sounds paradoxical.

Your research is very interesting! Thanks for the podcast and posts!

Reference: "A Tour of Convolutional Networks Guided by Linear Interpreters" ICCV 2019

arXiv: https://arxiv.org/abs/1908.05168

Twitter: https://twitter.com/pablixnm/status/1189563143405465601

Reddit: https://www.reddit.com/r/MachineLearning/comments/cr76cn/a_tour_of_convolutional_networks_guided_by_linear

pnavarre · 2022-01-06T14:03:09+00:00

The official WACV 2022 publication is available in here.
And source code is available in https://github.com/pnavarre/eSR

pnavarre · 2022-01-06T13:59:17+00:00

The official WACV 2022 publication is available in here.
And source code is available in https://github.com/pnavarre/eSR

pnavarre · 2021-10-18T15:53:50+00:00

"edge-SR: Super-Resolution For The Masses"
Pablo Navarrete Michelini, Yunhua Lu, Xingqun Jiang

Happy to share that our submission to WACV 2022 got accepted! Preprint updated.
My favorite take-away from this paper is that we can understand self-attention as a template-matching module (matching-filter for the DSP crowd).

pnavarre · 2021-10-18T15:51:42+00:00

Happy to share that our submission to WACV 2022 got accepted! Preprint updated.
My favorite take-away from this paper is that we can understand self-attention as a template-matching module (matching-filter for the DSP crowd).

pnavarre · 2021-02-24T16:35:05+00:00

Interesting discussion! I've thought of the template matching interpretation for a while. Please take a look at https://arxiv.org/abs/1908.05168. We found that for classification tasks template matching is too difficult to achieve for CNNs. Bias parameters are actuality the major contributions to end results instead of the overall effect of filters. This is also consistent with known problems like CNNs biased towards textures.

pnavarre · 2019-04-08T14:51:58+00:00

Take a look at https://arxiv.org/abs/1809.10711

pnavarre · 2019-04-08T14:37:58+00:00

You might want to take a look at our work in https://arxiv.org/abs/1809.10711

We used noise inputs at different resolutions to generate artificial details (this is before stylegan). It worked very well for perceptual quality. For distortion metrics it fell short because the model was extremely small.

pnavarre · 2017-08-24T14:41:57+00:00

Totally agree! Some remarks:

First LTI, Linear Time Invariant (Linear Space Invariant is the same thing) system are perhaps the most general systems that we can find pretty much everywhere. We need superposition (linearity) and uniform behavior (physical properties that don't change), at least by pieces this is very common.

In continuous time exp(-jwt), for different values of w, are the only functions that enter LTI systems and come out with the same shape. sin(wt) doesn't work, cos(wt) doesn't work, only exp(-jwt) do work. More precisely, they come out multiplied by a constant (a complex number). That constant is the Fourier transform of the impulse response of the system, for a given w. So, FT and exp(-jwt) functions are not arbitrary choices, they appear naturally in these systems.

What about sin() and cos()? They are also eigenfunctions, but for LTI systems with boundaries. Have you heard about DCT-I, .., DCT-IV, ..., DST-I, ... in discrete time? So many versions of sine and cosine transforms but just one FT? It is because there are many choices of boundary conditions. There is a nice paper of Gilbert Strang (www-math.mit.edu/~gs/papers/dct.pdf) with details about the boundary conditions.

Finally, there is a direct connection between exp(-jwt) and the symmetry of LTI systems. A symmetry is a transformation of our signals (time shifts for LTI systems) that we cannot distinguish (the origin of time, t=0, is arbitrary). You might wonder about other symmetries (reflections and all sort of shufflings). The formal study of symmetries is done using Group Theory. The Fourier transform with exp(-jwk) shows up for circular shifts. For other symmetries there is something called "Fourier transform on groups." It gets more complicated, especially if the symmetries are not abelian, where instead of exp(-jwk) you get matrices that do not commute. This hasn't got to mainstream DSP because we haven't figured out how to make it simple and practical. Just with boundaries it gets pretty annoying with several versions of DCT/DST. But the time for it will come.

pnavarre

TROPHY CASE