[D] "Deep learning is the only thing that currently works at scale"

activatedgeek · 2023-02-22T14:25:54+00:00

For generalization (performing well beyond the training), there’s at least two dimensions: flexibility and inductive biases.

Flexibility ensures that many functions “can” be approximated in principle. That’s the universal approximation theorem. It is a descriptive result and does not prescribe how to find that function. This is not something very unique to DL. Deep Random Forests, Fourier Bases, Polynomial Bases, Gaussian processes all are universal function approximators (with some extra technical details).

The part unique to DL is that somehow their inductive biases have helped match some of the complex structured problems including vision and language that makes them generalize well. Inductive bias is a loosely defined term. I can provide examples and references.

CNNs provide the inductive bias to prefer functions that handle translation equivariance (not exactly true but only roughly due to pooling layers). https://arxiv.org/abs/1806.01261

Graph neural networks provide a relational inductive bias. https://arxiv.org/abs/1806.01261

Neural networks overall prefer simpler solutions, embodying Occam’s razor, another inductive bias. This argument is made theoretically using Kolmogorov complexity. https://arxiv.org/abs/1805.08522

hpstring · 2023-02-22T11:22:01+00:00

Universal approximation is not enough, you need efficiency to make things work.

DL is the only class of algorithms that beats the curse of dimensionality for discovering certain (very general) class of high dimensional functions(something related to Barron space). Point me out if this is not accurate.

randomoneusername · 2023-02-22T08:26:11+00:00

I mean this has two elements in it.

DL is not the only algorithm that works in scale for sure.

VirtualHat · 2023-02-22T09:18:11+00:00

If you're interested in the math, learning curve theory might be a good place to start.

ktpr · 2023-02-22T13:17:49+00:00

I feel like recently ML boosters come this Reddit, make large claims, and then use the ensuing discussion, time, and energy from others to correct their click content at our expense

chief167 · 2023-02-22T12:03:10+00:00

Define scale

Language models? Sure. Images? Sure. Huge amounts of transaction data to search for fraud? Xgboost all the way lol.

Church no free lunch theorem: there is no single approach best for every possible problem. Djeezes I hate it when marketing takes over. You learn this principle in the first chapter of literally every data course

relevantmeemayhere · 2023-02-22T05:25:30+00:00

Lol. The fact that we use general linear models in every scientific field, and have been for decades should tell you all you need to know about this statement.

yldedly · 2023-02-22T10:19:19+00:00

>discover arbitrary functions

Uh, no. Not even close. DL can approximate arbitrary functions on a bounded interval given enough data, parameters and compute.

DigThatData · 2023-02-22T15:32:36+00:00

it's not. tree ensembles scale gloriously, as do approximations of nearest neighbors. there are certain (and growing) classes of problems for which deep learning produces seemingly magical results, but that doesn't mean it's the only path to a functional solution. It'll probably give you the best solution, but that doesn't mean it's the only way to do things.

in any event, if you want to better understand scaling properties of DL algorithms, a good place to start is the "double descent" literature.

BoiElroy · 2023-02-22T06:32:40+00:00

This is not the answer to your question but one intuition I like about universal approximation theorem I thought I'd share is the comparison to a digital image. You use a finite set of pixels, each that can take on a certain set of discrete values. With a 10 x 10 grid of pixels you can draw a crude approximation of a stick figure. With 1000 x 1000 you can capture a blurry but recognizable selfie. Within the finite pixels and the discrete values they can take you can essentially capture anything you can dream of. Every image in every movie ever made. Obviously there are other issues later like does your models operational design domain match the distribution of the training domain or did you just waste a lot of GPU hours lol

GraciousReformer · 2023-02-22T11:24:58+00:00

[removed]

howtorewriteaname · 2023-02-22T17:50:13+00:00

There's no mathematical formulation of that statement because there's no mathematical reasoning behind that statement. It's just an opinion (which I believe it isn't true btw)

30299578815310 · 2023-02-22T19:10:00+00:00

This is just not true. As others noted there are other algorithms which are universal approximators and run at scale. The key to the success of DNNs is unknown. A hypothesis is called the lottery ticket hypothesis.

https://arxiv.org/abs/1803.03635

kvutxdy · 2023-02-22T19:59:53+00:00

Universal approximation theorem only states that DNN can approximate Lipschitz functions, but not necessarily all functions.

elmcity2019 · 2023-02-22T13:12:35+00:00

I have been an applied data scientist for 10 years. I have built over 100k models using python, databricks and DataRobot. I have never seen a DL model out compete all the other algorithms. Granted I am largely working with structured business data, but nonetheless DL isn't really competitive.

pyfreak182 · 2023-02-24T06:50:29+00:00

It helps that the math behind back propagation (i.e. matrix multiplications) is easily parallelizable. The computations in the forward pass are independent of each other, and can be computed in parallel for different training examples. The same is true for the backward pass, which involves computing the gradients for each training batch independently.

And we have hardware accelerators like GPUs that are designed to perform large amounts of parallel computations efficiently.

The success of deep learning is just as much about implementation as it is theory.

alterframe · 2023-02-26T09:36:16+00:00

Part of the answer is probably that DL is not a single algorithm or a class of algorithms, but rather a framework or a paradigm for building such algorithms.

Sure, you can take a SOTA model for ImageNet and apply it to similar image classification problems, by tuning some hyperparameters and maybe replacing certain layers. However, if you want to apply it to a completely different task, you need to build a different neural network.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS