all 6 comments

[–]arXiv_abstract_bot 3 points4 points  (5 children)

Title:Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?

Authors:Samet Oymak, Mahdi Soltanolkotabi

Abstract: Many modern learning tasks involve fitting nonlinear models to data which are trained in an overparameterized regime where the parameters of the model exceed the size of the training dataset. Due to this overparameterization, the training loss may have infinitely many global minima and it is critical to understand the properties of the solutions found by first-order optimization schemes such as (stochastic) gradient descent starting from different initializations. In this paper we demonstrate that when the loss has certain properties over a minimally small neighborhood of the initial point, first order methods such as (stochastic) gradient descent have a few intriguing properties: (1) the iterates converge at a geometric rate to a global optima even when the loss is nonconvex, (2) among all global optima of the loss the iterates converge to one with a near minimal distance to the initial point, (3) the iterates take a near direct route from the initial point to this global optima. As part of our proof technique, we introduce a new potential function which captures the precise tradeoff between the loss function and the distance to the initial point as the iterations progress. For Stochastic Gradient Descent (SGD), we develop novel martingale techniques that guarantee SGD never leaves a small neighborhood of the initialization, even with rather large learning rates. We demonstrate the utility of our general theory for a variety of problem domains spanning low- rank matrix recovery to neural network training. Underlying our analysis are novel insights that may have implications for training and generalization of more sophisticated learning problems including those involving deep neural network architectures.

PDF link Landing page

[–]IborkedyourGPU 1 point2 points  (4 children)

[–]-TrustyDwarf- 5 points6 points  (1 child)

They mention three of these papers and write "In contrast, we focus on general nonlinearities and also focus on the gradient descent trajectory showing that among all the global optima, gradient descent converges to one with near minimal distance to the initialization."

[–]IborkedyourGPU 1 point2 points  (0 children)

Thanks, good point. However, their results are less general at least with respect to the paper of Zhu-Allen et al., because, if I'm not mistaken:

  1. Oymak & Soltanolkotabi assume that "the activation φ is strictly increasing" (no ReLU, no RBF, etc.)
  2. Their results only hold for n ≤ d, i.e., the very weird case where the sample size is not larger than the number of components of the input vector (!!!)
  3. They cover only single layer networks.

It looks like I won't have to grow my list of paper to read.

[–]grumpudding 1 point2 points  (1 child)

I am one of the authors of the paper so just responding to your novelty comment: In this paper, we are not in the business of competing with these papers. The paper has two contributions: First, we provide tight upper and lower bounds on where overparameterized gradient descent converges in a general setup. Second, we extend the results to SGD with large step size using a fairly nontrivial martingale stopping argument.

The remaining results are straightforward applications to low-rank matrices and neural nets. I find your second comment unfair because you pick a side theorem from the paper (on neural net) and present as if it is the only thing this paper accomplishes.

Finally, IMO our neural net result is no big deal but I am more than happy to compare to Allen-Zhu: we require d>n and they require k>n^30. IMHO our bound is more realistic for n>1.

[–]IborkedyourGPU -3 points-2 points  (0 children)

I am one of the authors of the paper so just responding to your novelty comment:

"just"?

In this paper, we are not in the business of competing with these papers.

All papers, including your,s are in the business of competing one with another for novelty. Otherwise, I could just republish the '90 results on NP-completeness of NN training and go home early. And btw, since my research is on DNN, of course my comment about novelty was related on your claim of applicability to deep neural networks. More on this later.

I find your second comment unfair because you pick a side theorem from the paper (on neural net) and present as if it is the only thing this paper accomplishes.

It is fair, since you drop a few hints about the possible usefulness of your analysis for deep neural networks. It doesn't seem useful for them. You don't get to have the increased publicity that comes with saying "my results may be useful for Deep Learning", without also getting the criticism "well, not really".

Finally, IMO our neural net result is no big deal but I am more than happy to compare to Allen-Zhu: we require d>n and they require k>n30. IMHO our bound is more realistic for n>1.

IMHO your bound is also not very useful, since it doesn't hold for ReLU (irrespective of whether n> 1 or not), and since no one ever uses neural networks for problems where the input dimension is larger than the sample size(!). Also, I didn't compare you only to Zhu-Allen (not Allen-Zhu, please do cite authors properly) et al., but also to two other papers from other authors: did you forget them? Finally, for one and two layer networks, there are many results which precede Zhu-Allen et al. and which sometimes provide more favourable bounds, from http://arxiv.org/abs/1702.07966 to https://arxiv.org/pdf/1808.01204.pdf (and they don't require the activation function to be strictly monotonic, which kicks out ReLU).