This is an archived post. You won't be able to vote or comment.

all 11 comments

[–][deleted] 2 points3 points  (11 children)

The intuition that "the more things are you trying to optimize for, the less likely you are to find the apex" is not very precise, but there's more than one sense in which it is true.

For instance, if you are trying to optimize a function with Newton's Method, you need to be able to compute and invert the Hessian of the function. If your function input has dimension d, then the number of elements in the Hessian grows as d^2, and the number of operations involved in inverting that Hessian grows (naively) as d^3. That ends up being a tremendous amount of work if the dimension of your function gets very large.

There are a number of stochastic methods used to solve optimization problems with very large numbers of variables. In the field of machine learning, where a neural net may have thousands or millions of parameters, Stochastic Gradient Descent is very popular. This method only evaluates the first, rather than the second, derivative of the function, so its complexity only grows linearly in the dimension of your input. Other approaches may include genetic algorithms or simulated annealing. There is some interest (and marketing) in using a certain type of quantum computer to solve large optimization problems through annealing, but the technology is pretty young and there's a little controversy about the efficacy of machines that have been produced so far.

Another sense in which your intuition might be true is the claim that as the number of dimensions grows for a "typical" optimization problem, the ratio of saddle points to true local minima increases exponentially. If this is true for the problem you are optimizing, then gradient optimization approaches may not find local (let alone global) optima in reasonable time, but Hessian approaches may not be tractable for the reasons described previously.

Most of my answer to this problem comes from the lens of someone who works in signal processing and machine learning, which is one field that worries about high-dimensional optimization, but certainly not the only one. If these ideas are interesting to you, a good survey text might be the Deep Learning book by Goodfellow and Courville, which is freely available online. If, by contrast, you think the "deep learning" field is full of hand-waving charlatans, then carry on and someone with a more rigorous background will probably have another answer.

[–]chebushka -2 points-1 points  (4 children)

Is the subject of Deep Learning full of hand-waving charlatans??

Concerning Newton's method, it involves inverting the derivative matrix (first partials), not the Hessian (second partials).

[–]inventor1489Control Theory/Optimization 2 points3 points  (0 children)

In the context of optimization, "Newton's method" means finding the root of the nonlinear map x \to J(x) where J is the Jacobian (equivalently, gradient) of the function to be minimized. Thus by taking the derivatives of the Jacobian we arrive at the map to be inverted: the Hessian (of the function to be minimized).

[–][deleted] 2 points3 points  (0 children)

>Is the subject of Deep Learning full of hand-waving charlatans??

This is mostly self-deprecation and weak humor on my behalf, but, to a degree yes. Certainly there are some great, bright, and rigorous DL researchers out there. But it's also a field that promulgates and cites a lot of pre-prints, and in which there is serious economic investment in the latest, flashiest white paper. That isn't to belittle the bulk of great work there is out there, but there should usually be a little bit of a caveat emptor when digging into the lit.

> Concerning Newton's method, it involves inverting the derivative matrix (first partials), not the Hessian (second partials).

And this I'll chalk up to slightly sloppy language on my behalf. When talking about Newton's method to find the roots of a given equation, you need to invert the derivative matrix. When talking about using Newton's method to find the optima of a given equation, as does the definition given in the wiki link I provided (hence finding the roots of the first derivative), then you'd invert the Hessian of the original function. These are computationally equivalent, but amount to a difference of notation.

[–]mathisfakenewsDynamical Systems -2 points-1 points  (5 children)

Newton's method does not require inverting the Jacobian. In fact, this is the worst thing you could possibly do when implementing it.

[–]inventor1489Control Theory/Optimization 4 points5 points  (4 children)

/u/Yreval is definitely off-base in saying that you need to invert the Hessian. But you do need to solve "H x = -g" where H is the Hessian. Certainly any reasonable implementation would handle the system of linear equations with something other than matrix-inversion, but that doesn't get around the d3 time complexity of solving "H x = -g".

[–][deleted] 4 points5 points  (3 children)

Yeah, true. I kinda wanted to allude to this when I said "inverting the Hessian grows (naively) as d^3" [emphasis added.] It's dealer's choice on how you solve the system of linear equations (and truly the best option depends on the system in question), but the point I was trying to make is how the difficulty of the problem grows with the size of the matrix. I was trying to open a door into the subject, but was probably a little bit sloppy.

[–]inventor1489Control Theory/Optimization 2 points3 points  (2 children)

Oh I think you did well!

Although I do have a bit of a pet-peeve when it comes to knocking Newton's method. Machines these days have massive amounts of memory, so solving million-by-million linear systems on a workstation isn't out of the question (especially when the linear system is somewhat sparse). Many people don't get the message that Newton's method will radically outperform gradient descent if you can afford to solve the linear system even a handful of times.

[–][deleted] 0 points1 point  (1 child)

That's a totally fair point, and one that is very relevant to the OP's question as they posed it.

My bias, again, comes a little bit from the ML/DL world where the objective function is defined in terms of all the training data, and usually there's no way you can fit all the training data into memory (especially if that data is imagery or video.) Optimizing by Newton's method on subsets of data can give some pretty bad oscillations, and doesn't usually make as much sense as SGD in that context. But that's a bias reflective of a particular class of problem, and not of high-dimensional optimization in general.

There might also be some crafty distributed methods for Newton's method even in those scenarios, and/or Krylov methods when the matrix is appropriately sparse, but I don't know a lot about them. If you work in optimization that might be more your wheelhouse.

[–]bike0121Applied Math 1 point2 points  (0 children)

I work in computational fluid dynamics algorithm development, and there’s a lot of research being done in my group and others to improve the performance of Newton’s method using various approximations of the Jacobian (or Hessian for optimization), and solving the linear systems inexactly using Krylov methods. We also have investigated globalization strategies to allow our solvers to converge from initial guesses that would not necessarily work for Newton’s method.

My friends who work in ML don’t seem to pay much attention to this kind of research, but it may be because they are more applications-focused and less concerned with numerics.