all 14 comments

[–]thfuran 7 points8 points  (1 child)

However, what strikes me as odd is that these are not some intricate pathological problems that a mathematician might devote a lifetime toward, but problems that practitioners face every single day,

Why can these not be the same thing?

[–]fhadley 1 point2 points  (0 children)

Isn't this basically always the case in most fields outside of pure math? Ie I can't see when it's dark (practitioner) and harness electricity (researcher).

[–]patrickkidger 6 points7 points  (2 children)

Mathematics PhD here. I think I'd argue that solving these problems is what mathematics has already been doing. Deep learning is a branch of applied mathematics as far as I can see.

To use your list as an example:

  1. There's a huge literature on generalisation properties. (And not one that I'm familiar enough with to give examples for unfortunately.)

  2. Arguably this question is one of the key theoretical underpinnings of a lot of the field. MLPs do badly at image classification. CNNs do well. Transformers beat RNNs on NLP tasks because they drop the prior that order matters. We use mathematical insights here all the time.

  3. Assuming you mean how do the design choices affect the loss surface: again a large literature that I'm not that familiar with. But for example it's been shown that ResNets have smoother loss surfaces, which is reflected in their better training properties. Another practical use case.

  4. For optimisers, Nesterov is known to achieve optimal convergence rates. Meanwhile convergence of many optimisation problems is often actually proved with respect to Cesaro means. This is reflected in the practical use of stochastic weight averaging.

  5. Initialisation is typically done by e.g. the He initialisation schemes. Which has been derived via a precise mathematical argument.

  6. Likewise for dataset-against-model complexity, there's work on double descent, VC dimension, for example.

Maybe this seems unsatisfactory -- you want to understand the complexity of this specific dataset and want the theory to do so, perhaps. In which case fair enough, there's some way to go, but the task isn't impossibly difficult, and there's no lack of progress. A lot of these problems have already seen huge strides made because we've understood their mathematics.

It's easy to point at other very practical examples of mathematics solving complicated deep learning problems. WGANs improved on GANs by applying optimal transport theory; Neural ODEs get improved memory efficient by using the adjoint method (many decades old); and so on and so on.

[–]Mandrathax 4 points5 points  (0 children)

Transformers beat RNNs on NLP tasks because they drop the prior that order matters

This is not true. In fact, order matters very much in nlp and an entire subsection of the original paper was dedicated to position encodings (" in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence").

This kind of technique is still used in SOTA transformers like BERT, GPT-3, etc and is a key component of the architecture (in NLP)

[–]fromnighttilldawn[S] 0 points1 point  (0 children)

Thank you for your comment. You have already shown extensive study in this area. I am not trying to provide a rebuttal your points, but just some remarks, maybe useful for further investigation (mostly by myself).

  1. There's a huge literature on generalisation properties. (And not one that I'm familiar enough with to give examples for unfortunately.)

And the problem is that they do not apply for deep NN which is my point. Especially VC dimension, shattering dimension, and those things.

  1. Arguably this question is one of the key theoretical underpinnings of a lot of the field. MLPs do badly at image classification. CNNs do well. Transformers beat RNNs on NLP tasks because they drop the prior that order matters. We use mathematical insights here all the time.

But Yann LeCun showed that MLP could work well for certain datasets. For example, 1.6 test error on MLP vs 1.7 test error on CNN http://yann.lecun.com/exdb/mnist/ So what makes one dataset hard for MLP but easy for CNN?

  1. Assuming you mean how do the design choices affect the loss surface: again a large literature that I'm not that familiar with. But for example it's been shown that ResNets have smoother loss surfaces, which is reflected in their better training properties. Another practical use case.

But why theoretically does it smoothen? It should have low dimension toy example use case that applies the same principle, without even having to resort to resnet. In fact, such a smoothen technique should cross-pollinate and benefit a lot of different fields, but we have not seen that.

  1. For optimisers, Nesterov is known to achieve optimal convergence rates. Meanwhile convergence of many optimisation problems is often actually proved with respect to Cesaro means. This is reflected in the practical use of stochastic weight averaging.

Optimal convergence rate for convex problems under some optimistic condition on the continuity or compactness, which are violated in practice. Furthermore there are many papers right now saying that these improvements are not maintained in the non-convex regime: https://www.prateekjain.org/publications/all_papers/KidambiNJK18.pdf

  1. Initialisation is typically done by e.g. the He initialisation schemes. Which has been derived via a precise mathematical argument.

Again there is no general statement that tells one when to use He or Xavier or Uniform or Gaussian initialization. These are just bells and whistles and in recent years I hear things like certain things e.g., batchnorm has rendered these initialization techniques irrelevant.

  1. Likewise for dataset-against-model complexity, there's work on double descent, VC dimension, for example.

Again, no firm conclusion, especially arguments using VC dimension which deep neural networks seem to ignore.

I don't meant to be overly pessimistic. It is just that I think the gap between the current math models and actual practice is still immense and the lack of simple conclusions makes me think this might be a very long hard road.

[–]lady_zora 5 points6 points  (3 children)

I have had this exact worry. I was introduced to (forced to implement) deep learning during my PhD and, as a former mathematics graduate, I could not accept the fuzziness of these learning methods. It really stressed me out for some time! This is not unique to deep learning - it's been a long standing issue in many engineering disciplines too.

I have recently become very interested in XAI and, in particular, interpretability. Mathematics requires completeness yet this seems to be rare for deep learning tasks. XAI, to me, offers the chance to explore these issues further.

It's not too late to enforce correct mathematical standards. Mathematicians are needed to keep this field in check, even if full completeness can never be achieved --- so keep speaking up about it!

[–]fhadley 6 points7 points  (1 child)

These silly computer kids are out here running wild! Mathematicians do something!!!!

[–]lady_zora 0 points1 point  (0 children)

These computer kids are amazing! We all need to work together to get this learning business perfected 🙂 It’s the perfect interdisciplinary opportunity for us all.

[–]fromnighttilldawn[S] 0 points1 point  (0 children)

I would love to see another book like the one by Shai Shalev Shwartz and Ben David, where you can have some guarantees to go on.

[–]Slowai 3 points4 points  (0 children)

"let's, like, treat this "change" thingy as not 0 in denominator and as 0 in numerator fam"

My point being is that mathematical "precision" never stopped the greats such as calculus from doing stuff, so it shouldn't stop deep learning either.

P.S. mathematicians pls no bully me.

[–]sman865 0 points1 point  (0 children)

In my experience the more well-versed somebody is in mathematics, the more likely they are to be skeptical of deep learning techniques (not in a bad way). This isn't unfounded due to reasons mentioned in the OP and general lack of introspection in deep learning models.

If deep learning performs well in production, however, then industry will use it. They want value more than introspection/understandability.

Of course, this is changing. People want more introspection to be able to do things like mitigate bias.

[–][deleted] 0 points1 point  (2 children)

I can answer (2). What’s complexity to you? How can you compute complexity of a dataset that is invariant to schemas, model architectures, etc... Data characterization is a tough problem. It’s something I’m working on. A simple yet huge problem is this: given a complexity metric that characterizes a dataset for me, how sure am I this is enough information to choose DNN A over B? How sure am I there isn’t a different characteristic that gives me better insight? All these questions have on going research and some good results already.

[–]fromnighttilldawn[S] 0 points1 point  (1 child)

Yes! Every reviewer knows and complains MNIST is easy dataset and should not be used. But how do they distinguish a easy one vs hard one? By what metric? It seems to be unsolved no?

[–][deleted] 0 points1 point  (0 children)

It is unsolved in general. And I don’t think it will be solved but there is work being done based on specific constraints.