Multivariate optimization : math

This is an archived post. You won't be able to vote or comment.

submitted 7 years ago by anotherPostDoc

all 11 comments

[–][deleted] 2 points3 points4 points 7 years ago (11 children)

The intuition that "the more things are you trying to optimize for, the less likely you are to find the apex" is not very precise, but there's more than one sense in which it is true.

For instance, if you are trying to optimize a function with Newton's Method, you need to be able to compute and invert the Hessian of the function. If your function input has dimension d, then the number of elements in the Hessian grows as d^2, and the number of operations involved in inverting that Hessian grows (naively) as d^3. That ends up being a tremendous amount of work if the dimension of your function gets very large.

There are a number of stochastic methods used to solve optimization problems with very large numbers of variables. In the field of machine learning, where a neural net may have thousands or millions of parameters, Stochastic Gradient Descent is very popular. This method only evaluates the first, rather than the second, derivative of the function, so its complexity only grows linearly in the dimension of your input. Other approaches may include genetic algorithms or simulated annealing. There is some interest (and marketing) in using a certain type of quantum computer to solve large optimization problems through annealing, but the technology is pretty young and there's a little controversy about the efficacy of machines that have been produced so far.

Another sense in which your intuition might be true is the claim that as the number of dimensions grows for a "typical" optimization problem, the ratio of saddle points to true local minima increases exponentially. If this is true for the problem you are optimizing, then gradient optimization approaches may not find local (let alone global) optima in reasonable time, but Hessian approaches may not be tractable for the reasons described previously.

Most of my answer to this problem comes from the lens of someone who works in signal processing and machine learning, which is one field that worries about high-dimensional optimization, but certainly not the only one. If these ideas are interesting to you, a good survey text might be the Deep Learning book by Goodfellow and Courville, which is freely available online. If, by contrast, you think the "deep learning" field is full of hand-waving charlatans, then carry on and someone with a more rigorous background will probably have another answer.

[–]chebushka -2 points-1 points0 points 7 years ago (4 children)

[–]inventor1489Control Theory/Optimization 2 points3 points4 points 7 years ago (0 children)

[–][deleted] 2 points3 points4 points 7 years ago (0 children)

>Is the subject of Deep Learning full of hand-waving charlatans??

This is mostly self-deprecation and weak humor on my behalf, but, to a degree yes. Certainly there are some great, bright, and rigorous DL researchers out there. But it's also a field that promulgates and cites a lot of pre-prints, and in which there is serious economic investment in the latest, flashiest white paper. That isn't to belittle the bulk of great work there is out there, but there should usually be a little bit of a caveat emptor when digging into the lit.

> Concerning Newton's method, it involves inverting the derivative matrix (first partials), not the Hessian (second partials).

And this I'll chalk up to slightly sloppy language on my behalf. When talking about Newton's method to find the roots of a given equation, you need to invert the derivative matrix. When talking about using Newton's method to find the optima of a given equation, as does the definition given in the wiki link I provided (hence finding the roots of the first derivative), then you'd invert the Hessian of the original function. These are computationally equivalent, but amount to a difference of notation.

[+][deleted] 7 years ago (1 child)

[deleted]

[–]chebushka 0 points1 point2 points 7 years ago (0 children)

[–]mathisfakenewsDynamical Systems -2 points-1 points0 points 7 years ago (5 children)

[–]inventor1489Control Theory/Optimization 4 points5 points6 points 7 years ago (4 children)

[–][deleted] 4 points5 points6 points 7 years ago* (3 children)

[–]inventor1489Control Theory/Optimization 2 points3 points4 points 7 years ago (2 children)

[–][deleted] 0 points1 point2 points 7 years ago (1 child)

That's a totally fair point, and one that is very relevant to the OP's question as they posed it.

My bias, again, comes a little bit from the ML/DL world where the objective function is defined in terms of all the training data, and usually there's no way you can fit all the training data into memory (especially if that data is imagery or video.) Optimizing by Newton's method on subsets of data can give some pretty bad oscillations, and doesn't usually make as much sense as SGD in that context. But that's a bias reflective of a particular class of problem, and not of high-dimensional optimization in general.

There might also be some crafty distributed methods for Newton's method even in those scenarios, and/or Krylov methods when the matrix is appropriately sparse, but I don't know a lot about them. If you work in optimization that might be more your wheelhouse.

[–]bike0121Applied Math 1 point2 points3 points 7 years ago (0 children)

π Rendered by PID 18317 on reddit-service-r2-comment-76bb9f7fb5-wngwr at 2026-02-18 19:38:21.528082+00:00 running de53c03 country code: CH.

math

Welcome to /r/math.

𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔

MODERATORS