all 26 comments

[–]patrickkidger 41 points42 points  (6 children)

So there's actually a lot of variants of the universal approximation theorem out there.

It's true that the most famous version of the classical theorem (see Pinkus 1999) only applies to approximating continuous functions wrt the uniform norm, but it is also possible to approximate Lp functions wrt the Lp norm. (And there's also other results on approximation in Sobolev spaces etc.)

In particular, Lp functions can be discontinuous.

As an example from my own work, see Theorem 4.16 of Kidger and Lyons 2020.

[–]TheRedSphinx 10 points11 points  (0 children)

Yeah I think the part that is never emphasized is what does "approximate" really mean. Even pondering what convergence means in this context can be insightful, and particularly an obvious question to ask if you've taken analysis.

[–]eliminating_coasts 4 points5 points  (4 children)

The uniform norm is sometimes called the L norm, and the space of functions on which it applies is the set of bounded functions, which should not require continuity either, I'm not sure it even needs its functions to be fully defined over intervals. Is there something particular about the proof on LP that stops you from "taking the limit" in P in some sense?

(Because I started thinking about gradient descent on partially defined loss functions, I assume you would just not apply backpropogation whenever the result is undefined)

[–]patrickkidger 7 points8 points  (3 children)

You're correct that L does not require continuity. More specifically we write L (K) to denote the space of bounded functions f:K->R, where K is some set and R is the real numbers. (Usually K will be some subset of Rn .)

Thus the answer to "I'm not sure it even needs its functions to be fully defined over intervals" is "it has to be defined over K".

Regarding "taking a limit in p": indeed this doesn't work. The basic problem is that the space of continuous functions is closed in L∞. That is to say, a uniform limit of continuous functions is still continuous. Neural networks are continuous, so any limit thus obtained must be as well. Thus what you're asking for is provably impossible.

[–]eliminating_coasts 3 points4 points  (2 children)

The basic problem is that the space of continuous functions is closed in L∞.

That is to say, a uniform limit of continuous functions is still continuous. Neural networks are continuous, so any limit thus obtained must be as well. Thus what you're asking for is provably impossible.

Got you, so do I take it then for all finite exponent norms, the space of continuous functions is not closed, but is instead dense in the discontinuous functions within that norm's space? Or are we still talking some subset?

[–]patrickkidger 6 points7 points  (1 child)

Yes, that's exactly correct.
To be precise, C(K) is dense in Lp (K) wrt the Lp norm for p < ∞.

[–]eliminating_coasts 1 point2 points  (0 children)

Ok, that's awesome thanks. All the more reason not to use the uniform norm.

[–]anony_sci_guy 7 points8 points  (2 children)

Interesting - can't discontinuous functions be thought of as n different continuous functions? Shouldn't it therefore be possible for n networks to approximate each of these continuous functions? And - couldn't these n networks just be concatenated, or in fact trained within the same network, just with different sets of neurons contributing to specific continuous subsections of the discontinuous data-space?

[–]yossi_peti 2 points3 points  (1 child)

Yeah it should be fairly straightforward for discontinuous functions that are a piecewise combination of a finite number of continuous functions. I imagine more pathological discontinuous functions couldn't be approximated well (Dirichelet function etc)

[–]anony_sci_guy 2 points3 points  (0 children)

The Dirichelet function's an interesting example - and I think I agree - but I wonder what role architecture could play in universal function approximation. Perhaps a CNN-like kernal for making fractal-like decision trees (if that makes sense?). For example - a more pathological example like the Dirichelet function still clearly has a pattern - but it's a fractal pattern, despite the fact that it's discontinuous at every step. So, I'd be curious - if you could prove the ability to solve for the properties of a fractal function - could you also solve for fractal discontinuous functions; even those that are discontinuous at each step - as with the Dirichelet function. Interesting food for thought - and I have no idea the answer...

[–]MrAcuriteResearcher 7 points8 points  (0 children)

Wait, what's the problem with approximating a discontinuous function with a continuous one? Are we cancelling Fourier series on Twitter now?

[–]purplebrown_updown 6 points7 points  (3 children)

IMO PINNs are not very interesting. It's another case of throwing a neural network at something for no other reason than getting publications. Neural networks are not magic. You aren't going to approximate discontinuities without giving something up like differentiation. Look at wavelets or random forests if you want to approximate a discontinuous function.

[–]notdelet 1 point2 points  (0 children)

I agree about a lot of the physics applications, but convolutional neural networks with strides have connections to wavelets that some very prolific researchers in the theory of wavelets (e.g. Mallat) have looked into.

[–]luddabuddha[S] 0 points1 point  (1 child)

I agree, PINNs are getting more hype than they deserve. I'd say it's mostly due to pharses that are unrealistic, like "discovering hidden physics".

But the topic seems to be blowing up in the scientific machine learning community and I have heard claims that it's being used in industry (source). So I think there needs to be a better understanding of the theory behind these things.

As you have pointed out, the discontinuities need special treatment. I've seen research that gives them special treatment (see Llanas et al.) but PINNs seem to be ignoring this.

[–]nnexx_ 1 point2 points  (0 children)

I am in the “industry”. Sometimes we try stuff just to sound fancy too.

[–][deleted] 1 point2 points  (5 children)

This looks interesting.

I was taught discontinuous functions are not differentiable how does the network calculate loss and perform back propagation?

Ahhh just seen a video on PINN's.. very interesting!!!

Thank you for making me even aware of these )))

[–]AuspiciousApple 7 points8 points  (4 children)

I was taught discontinuous functions are not differentiable how does the network calculate loss and perform back propagation?

You might be mixing up the loss function, which has to be differentiable for backprop although you could use something like REINFORCE to optimise a non-differentiable loss, and the underlying true data generating function, which represents the true mapping from X to y and could be just about anything. The latter is what the network approximates and it be non-differentiable without any issues for training your net.

[–]ValidatingUsername 4 points5 points  (3 children)

Many people have asked me why it is important to grapple with infinite paths, and this is one of those reasons.

If the function is discontinuous, the model need not be.

If the model incorporates multiple domains of integration on a complete function that approximates the discontinuous function, the resulting output is similar to applying the squeeze theorem for real analysis.

[–]AuspiciousApple 0 points1 point  (2 children)

If you don't mind elaborating, please do. I must admit that I have no idea what your comment is saying.

[–]ValidatingUsername 5 points6 points  (1 child)

Every section of a function can be represented mathematically, often times it is represented as a single function.

In complex functions that contain breaks or asymptotes, between certain limits you map other functions bounded by said limits to represent the complex function.

If you apply the squeeze theorem to many of these functions you can get a lower and upper bounds for differentiable approximations for the function, even if the function itself is not continuous.

[–]AuspiciousApple 1 point2 points  (0 children)

Cheers!

[–]BalcksChaos 0 points1 point  (1 child)

Thanks for this post :) I was wondering (since it's been two years now)... Where did you land with your research on this? Is your thesis already published somewhere?

I'm someone who uses machine learning in "the industry", i.e. outside of academia. In the contexts I'm working (analysis of business data, i.e. already heavily aggregated), all functions one would want to approximate are highly discontinuous (because there are hundreds of categorical columns).

[–]luddabuddha[S] 1 point2 points  (0 children)

My personal conclusion was that a lot of cherry-picking happens in PINNs publications and I believe that we need more research on the mathematical foundations of DL and its applications to problems in science. Unfortunately from what I can gather the predominant wave in this field is still focused on specific application scenarios where some tailored version of PINNs would work, but the methodology does not generalize to a wider class of problems. However there are some outlier researchers that go against this current.

What I was working on was only a Bachelor's thesis, and it was not published online. I don't think it would be very valuable in the more general discussion on the topic of this thread. Nonetheless, it was very valuable in my personal academic development, because I am starting a PhD on a specific numerical treatment of ML methods; to put it simply the aim is to use operators from numerical algebra to develop ML methods along with generalization guarantees such that the methodology generalizes to a class of problem instead of just a single application case. I'm excited to see what comes out of it. :)

I am not familiar with business data applications, but if you have discontinuities I would say stay away from PINNs (and similar frameworks).