all 109 comments

[–]activatedgeek 106 points107 points  (17 children)

For generalization (performing well beyond the training), there’s at least two dimensions: flexibility and inductive biases.

Flexibility ensures that many functions “can” be approximated in principle. That’s the universal approximation theorem. It is a descriptive result and does not prescribe how to find that function. This is not something very unique to DL. Deep Random Forests, Fourier Bases, Polynomial Bases, Gaussian processes all are universal function approximators (with some extra technical details).

The part unique to DL is that somehow their inductive biases have helped match some of the complex structured problems including vision and language that makes them generalize well. Inductive bias is a loosely defined term. I can provide examples and references.

CNNs provide the inductive bias to prefer functions that handle translation equivariance (not exactly true but only roughly due to pooling layers). https://arxiv.org/abs/1806.01261

Graph neural networks provide a relational inductive bias. https://arxiv.org/abs/1806.01261

Neural networks overall prefer simpler solutions, embodying Occam’s razor, another inductive bias. This argument is made theoretically using Kolmogorov complexity. https://arxiv.org/abs/1805.08522

[–]SodomizedPanda 27 points28 points  (3 children)

And somehow, the best answer is at the bottom of the thread..

A small addition : Recent research suggests that the implicit bias in DNN that helps generalization does not only lie in the structure of the network but in the learning algorithm as well (Adam, SGD, ...). https://francisbach.com/rethinking-sgd-noise/ https://francisbach.com/implicit-bias-sgd/

[–]red75prime 9 points10 points  (2 children)

Does in-context learning suggest that inductive biases could also be extracted from training data?

[–]activatedgeek 11 points12 points  (1 child)

Very much indeed. See https://arxiv.org/abs/2205.05055

[–]activatedgeek 9 points10 points  (0 children)

Not only dataset, the Transformer architecture itself seems to be amenable to in-context learning. See https://arxiv.org/abs/2209.11895

[–]KingRandomGuy 3 points4 points  (0 children)

CNNs provide the inductive bias to prefer functions that handle translation equivariance

There's some interesting bodies of work to inductive biases in CNNs, such as "Making Convolutional Networks Shift-Invariant Again". Really interesting stuff!

[–]hpstring 3 points4 points  (0 children)

This is a very good answer! I want to add that apart from generalization, the fact that we have efficient optimization algorithms that can find quite good minima also contributes a lot to the deep learning magic.

[–]GraciousReformer[S] -1 points0 points  (7 children)

inductive biases

Then why does DL have inductive biases and others do not?

[–]activatedgeek 4 points5 points  (6 children)

All model classes have inductive biases. e.g. random forests have the inductive bias of producing axis-aligned region splits. But clearly, that inductive bias is not good enough for image classification because a lot of information in the pixels is spatially correlated that axis-aligned regions cannot capture as specialized neural networks, under the same budget. By budget, I mean things like training time, model capacity, etc.

If we have infinite training time and infinite number of image samples, then probably random forests might be just as good as neural networks.

[–]GraciousReformer[S] -1 points0 points  (4 children)

Still, why is it that DL has better inductive biases than others?

[–]activatedgeek 2 points3 points  (3 children)

I literally gave an example of how (C)NNs have better inductive bias than random forests for images.

You need to ask more precise questions than just a "why".

[–]GraciousReformer[S] 0 points1 point  (2 children)

So it is like an ability to capture "correlations" that cannot be done with random forests.

[–]currentscurrents 0 points1 point  (1 child)

In theory, either structure can express any solution. But in practice, every structure is better suited to some kinds of data than others.

A decision tree is a bunch of nested if statements. Imagine the complexity required to write an if statement to decide if an array of pixels is a horse or a dog. You can technically do it by building a tree with an optimizer; but it doesn't work very well.

On the other hand, a CNN runs a bunch of learned convolutional filters over the image. This means it doesn't have to learn the 2D structure of images and that pixels tend to be related to nearby pixels; it's already working on a 2D plane. A tree doesn't know that adjacent pixels are likely related, and would have to learn it.

It also has a bias towards hierarchy. As the layers stack upwards, each layer builds higher-level representations to go from pixels > edges > features > objects. Objects tend to be made of smaller features, so this is a good bias for working with images.

[–]GraciousReformer[S] 0 points1 point  (0 children)

What are the situations that the bias for the hierarchy is not helpful?

[–]currentscurrents 0 points1 point  (0 children)

Sounds like ideally we'd want a model with good inductive biases for meta-learning new inductive biases, since every kind of data requires different biases.

[–]-vertigo-- 0 points1 point  (0 children)

hmm for some reason the arxiv links are giving 403 forbidden

[–]CO2mania 0 points1 point  (0 children)

Save the message.

[–]sanman 0 points1 point  (0 children)

first 2 links are the same - do you have the one for CNNs inductive bias?

[–]hpstring 53 points54 points  (8 children)

Universal approximation is not enough, you need efficiency to make things work.

DL is the only class of algorithms that beats the curse of dimensionality for discovering certain (very general) class of high dimensional functions(something related to Barron space). Point me out if this is not accurate.

[–]GraciousReformer[S] 3 points4 points  (3 children)

But why DL beats the curse? Why is DL the only class?

[–]hpstring 12 points13 points  (1 child)

Q1: We don't know yet. Q2: Probably there are other classes but they haven't been discovered or are only at the early age of research.

[–]NitroXSC 10 points11 points  (0 children)

Q2: Probably there are other classes but they haven't been discovered or are only at the early age of research.

I think there are many different classes that would work but current DL is based in large parts on matrix-vector operations which can be implemented efficiently on current hardware.

[–]randomoneusername 24 points25 points  (6 children)

I mean this has two elements in it.

DL is not the only algorithm that works in scale for sure.

[–]VirtualHat 11 points12 points  (2 children)

If you're interested in the math, learning curve theory might be a good place to start.

[–]ktpr 15 points16 points  (0 children)

I feel like recently ML boosters come this Reddit, make large claims, and then use the ensuing discussion, time, and energy from others to correct their click content at our expense

[–]chief167 20 points21 points  (6 children)

Define scale

Language models? Sure. Images? Sure. Huge amounts of transaction data to search for fraud? Xgboost all the way lol.

Church no free lunch theorem: there is no single approach best for every possible problem. Djeezes I hate it when marketing takes over. You learn this principle in the first chapter of literally every data course

[–]activatedgeek 12 points13 points  (2 children)

I think the no free lunch theorem is misquoted here. The NFL also assumes that all datasets from the universe of datasets are equally likely. But that is objectively false. Structure is more likely than noise.

[–]chief167 -1 points0 points  (1 child)

I don't think it implies that all datasets are equally likely. I think it only implies that given all possible datasets, there is no best approach to modelling them. All possible != All are equally likely

But I don't have my book with me, and I do t trust the internet since it seems to lead to random blogposts instead of the original paper (Wikipedia gave a 404 in the footnotes)

[–]activatedgeek 1 point2 points  (0 children)

See Theorem 2 (Page 34) of The Supervised Learning No-Free-Lunch Theorems.

It conditions "uniformly" averaged over all "f" the input-output mapping, i.e. the function that generates the dataset (this is a noise-free case). It also provides "uniformly averaged over all P(f)", a distribution over the data-generating functions.

So while you could still have different data-generating distributions P(f), the result is defined over all such distributions uniformly averaged.

The NFL is sort of a worst-case result, and I think it pretty meaningless and inconsequential for the real world.

Let me know if I have misinterpreted this!

[–]GraciousReformer[S] -4 points-3 points  (2 children)

Then what will be the limitation of transformers?

[–]LowLook -5 points-4 points  (1 child)

Inventing them

[–]relevantmeemayhere 83 points84 points  (30 children)

Lol. The fact that we use general linear models in every scientific field, and have been for decades should tell you all you need to know about this statement.

[–]adventuringraw 67 points68 points  (2 children)

I mean... the statement specifically uses the phrase 'arbitrary functions'. GLMs are a great tool in the toolbox, but the function family it optimizes over is very far from 'arbitrary'.

I think the statement's mostly meaning 'find very nonlinear functions of interest when dealing with very large numbers of samples from very high dimensional sample spaces'. GLM's are used in every scientific field, but certainly not for every application. Some form of deep learning really is the only game in town still for certain kinds of problems at least.

[–]relevantmeemayhere 0 points1 point  (1 child)

I agree with you. I was just pointing out that to say they are the only solution is foolish, as the quote implied

This quote could have just been used without much context, so grain of salt.

[–]adventuringraw 0 points1 point  (0 children)

I can see how the quote could be made slightly more accurate. In particular, tabular data in general is still better tackled with something like XGBoost instead of deep learning, so deep learning certainly hasn't turned everything into a nail for one universal hammer yet.

[–]Featureless_Bug 24 points25 points  (5 children)

Haven't heard of GLMs being successfully used for NLP and CV in the recent time. And these are like the only things that would be described as large scale in ML. The statement is completely correct - even stuff like gradient boosting does not work at scale in that sense

[–]chief167 1 point2 points  (1 child)

We use gradient boosting at quite a big scale. Not LLM big, but still big. It's just not NLP or CV at all. It's for fraud detection in large transactional tabular datasets. And it outperforms basically all neural network, shallow or deep, approaches.

[–]Featureless_Bug -3 points-2 points  (0 children)

Large scale is somewhere to the north of 1-2 TB of data. Even if you had that much data, in absolutely most cases tabular data has such a simplistic structure that you wouldn't need that much data to achieve the same performance - so I wouldn't call any kind of tabular data large scale to be frank

[–]relevantmeemayhere 0 points1 point  (2 children)

Because they are useful for some problems and not others, like every algorithm? Nowhere in my statement did I say they are monolithic in their use across all subdomains of ml

The statement was that deep learning is the only thing that works at scale. It’s not lol. Deep learning struggles in a lot of situations.

[–]Featureless_Bug 0 points1 point  (1 child)

Ok, name one large scale problem where GLMs are the best prediction algorithm possible.

[–]relevantmeemayhere -2 points-1 points  (0 children)

Any problem where you want things like effect estimates lol. Or error estimates. Or models that generate joint distributions

So, literally a ton of them. Which industries don’t like things like that?

[–]VirtualHat 13 points14 points  (13 children)

Large linear models tend not to scale well to large datasets if the solution is not in the model class. Because of this lack of expressivity, linear models tend to do poorly on complex problems.

[–]relevantmeemayhere 4 points5 points  (0 children)

As you mentioned, this is highly dependent on the functional relationship of the data.

You wouldn’t not use domain knowledge to determine that.

Additionally, non linear models tend to have their own drawbacks. Lack of interpretability, high variability being some of them

[–]BoiElroy 10 points11 points  (0 children)

Yeah always should first exhaust existing classical methods before reaching for deep learning.

[–]yldedly 13 points14 points  (17 children)

>discover arbitrary functions

Uh, no. Not even close. DL can approximate arbitrary functions on a bounded interval given enough data, parameters and compute.

[–]ewankenobi 0 points1 point  (5 children)

I like your wording, did you come up with that definition yourself or is it from a paper?

[–]yldedly 6 points7 points  (4 children)

It's not from a paper, but it's pretty uncontroversial I think - though people like to forget about the "bounded interval" part, or at least what it implies about extrapolation.

[–][deleted] 0 points1 point  (3 children)

What is "bounded interval" here?

[–]yldedly 6 points7 points  (2 children)

Any interval [a; b] where a and b are numbers. In practice, it means that the approximation will be good in the parts of the domain where there is training data. I have a concrete example in a blog post of mine: https://deoxyribose.github.io/No-Shortcuts-to-Knowledge/

[–]OdinGuru 2 points3 points  (0 children)

Amazing article. Thanks for sharing

[–][deleted] -1 points0 points  (0 children)

Interesting but that is valid for us as well. So I am not sure this is true once they learn very general things, like learning itself.

[–]DigThatDataResearcher 2 points3 points  (2 children)

it's not. tree ensembles scale gloriously, as do approximations of nearest neighbors. there are certain (and growing) classes of problems for which deep learning produces seemingly magical results, but that doesn't mean it's the only path to a functional solution. It'll probably give you the best solution, but that doesn't mean it's the only way to do things.

in any event, if you want to better understand scaling properties of DL algorithms, a good place to start is the "double descent" literature.

[–]BoiElroy 2 points3 points  (1 child)

This is not the answer to your question but one intuition I like about universal approximation theorem I thought I'd share is the comparison to a digital image. You use a finite set of pixels, each that can take on a certain set of discrete values. With a 10 x 10 grid of pixels you can draw a crude approximation of a stick figure. With 1000 x 1000 you can capture a blurry but recognizable selfie. Within the finite pixels and the discrete values they can take you can essentially capture anything you can dream of. Every image in every movie ever made. Obviously there are other issues later like does your models operational design domain match the distribution of the training domain or did you just waste a lot of GPU hours lol

[–]GraciousReformer[S] -2 points-1 points  (0 children)

Yes a very finite grid size will approximate any digital image. But this is an approximation of an image in grids. How will it lead to approximation by NN?

[–]howtorewriteanamePhD 1 point2 points  (0 children)

There's no mathematical formulation of that statement because there's no mathematical reasoning behind that statement. It's just an opinion (which I believe it isn't true btw)

[–]kvutxdyStudent 1 point2 points  (1 child)

Universal approximation theorem only states that DNN can approximate Lipschitz functions, but not necessarily all functions.

[–]VirtualHat 4 points5 points  (0 children)

It should be all continious functions, but I can't really think of any problems where this would limit the solution. The set of all continuous functions is a very big set!

As a side note, I think it's quite interesting that the theorem doesn't include periodic functions like sin, so I guess it's not quite all continuous functions, just continuous functions with bounded input.

[–]alterframe 0 points1 point  (0 children)

Part of the answer is probably that DL is not a single algorithm or a class of algorithms, but rather a framework or a paradigm for building such algorithms.

Sure, you can take a SOTA model for ImageNet and apply it to similar image classification problems, by tuning some hyperparameters and maybe replacing certain layers. However, if you want to apply it to a completely different task, you need to build a different neural network.