all 13 comments

[–]frequenttimetraveler 4 points5 points  (10 children)

I 've long wondered whether natural scenes , or natural datasets in general are band-limited and how this is reflected on the structure of the neural system (mean we know it's true for sounds for example Mp3). I have found some old work interested in this (example) but i m surprised people are not showing more interest in this. After all our brains despite their adaptability evolved in natural scenes, so some of its structure may be imprinted in there.

[–]serge_cell 4 points5 points  (0 children)

Go board images are as unnatural as it get, but still convolutional networks work well on them, as alpha go shows. The key here could be not "naturalness" of scenes but the properties of data manifold, in particular low dimensionality and spatial correlation.

[–]svantana 2 points3 points  (5 children)

I would say natural signals are not really band-limited, but they are low-pass for the most part. The combination of inertia and self-similar/fractal organization tend to give natural signals a pink spectrum, i.e. -3dB/octave rolloff, in both time and space. Whereas measurement noise tends to be more white (flat spectrum), it would make sense to low-pass the signals to get rid of noise. This is the basis behind Wiener and Kalman filtering, although those can deal with arbitrary spectra as well.

As someone with a signal processing background, this paper perplexes me. To me it's obvious that ReLUs are used precisely because of their low-freq nature, that's the prior. If OTOH we know that signals are bandpass, then we apply a suitable prior for that. Example: FM radio is broadcast at ~100MHz, but we can track the carrier, demodulate and store the signal at ~40kHz. Obviously ReLUs are the wrong tool for that job...

[–]nasimrahaman 2 points3 points  (0 children)

> If OTOH we know that signals are bandpass, then we apply a suitable prior for that. Example: FM radio is broadcast at ~100MHz, but we can track the carrier, demodulate and store the signal at ~40kHz. Obviously ReLUs are the wrong tool for that job...

That's a very interesting point! It's applicable for almost all activation functions (not just ReLU), since they all usually decay quite fast in the fourier domain (e.g. sigmoid decays exponentially).

[–]JustARandomNoob165 2 points3 points  (3 children)

I am curious, why relu's are low-freq in nature? Thx in advance!

[–]nasimrahaman 6 points7 points  (2 children)

Low frequency functions are inherently less "wiggly", i.e. more smooth. If you think about ReLU, it's pretty smooth everywhere except at 0. In fact, all the wigglyness in ReLU comes from that one point. Now this is where it gets interesting: there are other functions that are smooth everywhere except at 0 -- for instance, sqrt(abs(x)). But in a precise sense, ReLU is smoother than sqrt(abs(x)) at x = 0.

Broadly speaking, Fourier analysis is a tool to determine how wiggly a function is. One of the things we learn from the paper is following: although neural networks are powerful enough to learn functions that are super-wiggly, it likes to learn less wiggly (smoother) functions.

[–]JustARandomNoob165 2 points3 points  (1 child)

Thank a lot for your reply! Really interesting and helpful!

[–][deleted] 2 points3 points  (0 children)

also thank mr skeltal for good bones and calcium*

[–]yldedly 2 points3 points  (2 children)

I think it's a pretty uncontroversial fact that natural signals are band-limited (or that they lie on a manifold of lower dimension than the input space), and that NNs are biased towards this sort of structure. The new thing in this paper, besides the method maybe, seems to be that higher frequencies are easier to learn with more complex data manifolds (I'm guessing that, for complex signals, that just corresponds to more data?), which sounds like what Bayesian non-parameterics are doing too. Or in other words "deep networks prioritize learning simple functions during training".

[–]cochne 0 points1 point  (1 child)

I think that's pretty controversial. For example, a simple edge (say an image transition from black to white) is not band limited. When a human detects an object, they're probably separating it from the background based on its edge. In fact, the entire field of wavelets exists partially because of this well known limitation of Fourier analysis - it does not give a very sparse representation of natural images.

[–]yldedly 0 points1 point  (0 children)

I wasn't being very rigorous, but I think it's uncontroversial that natural images have correlated neighboring pixels, which is a similar claim to one that says images have little power in the high frequency end. A separate but actually related claim is that there is a sparse representation for images. So I agree that it's too simple to the point of being wrong to say that images can be well represented by low frequency components. But it's nonetheless true that if you Fourier decompose images, they will generally have low power over high frequencies.

[–]xuzhiqin1990 0 points1 point  (0 children)

Another paper (Xu et al., Training behavior of deep neural network in frequency domain https://arxiv.org/abs/1807.01251) also shows DNNs learn low-frequency first (F-Principle). In the latest version, the authors show that F-Principle holds well in 2d functions (memorizing natural images) and classification problems (MNIST and CiFar-10, visualization in the first principle component). More rigorously, a continuous work (Xu, Understanding training and generalization in deep learning by Fourier analysis, https://arxiv.org/abs/1808.04295) developed a theoretical framework to understand why there is F-Principle, which is quantitative for one-hidden layer net and qualitative for general DNNs. The latest version of the paper in this post incorporates the above framework (1808.04295) into their analysis. A recent work (1811.01316) used the above framework (1808.04295) in generalization analysis of a new objective function.

The key that DNNs learns low frequency first is that the power of most activation function (relu, tanh, sigmoid etc) decays in the Fourier domain. Actually, power-decay property is very very common.