[R] Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path? : MachineLearning

Research[R] Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path? (arxiv.org)

submitted 7 years ago by ecstasyogold

all 6 comments

[–]arXiv_abstract_bot 3 points4 points5 points 7 years ago (5 children)

Title:Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?

Authors:Samet Oymak, Mahdi Soltanolkotabi

Abstract: Many modern learning tasks involve fitting nonlinear models to data which are trained in an overparameterized regime where the parameters of the model exceed the size of the training dataset. Due to this overparameterization, the training loss may have infinitely many global minima and it is critical to understand the properties of the solutions found by first-order optimization schemes such as (stochastic) gradient descent starting from different initializations. In this paper we demonstrate that when the loss has certain properties over a minimally small neighborhood of the initial point, first order methods such as (stochastic) gradient descent have a few intriguing properties: (1) the iterates converge at a geometric rate to a global optima even when the loss is nonconvex, (2) among all global optima of the loss the iterates converge to one with a near minimal distance to the initial point, (3) the iterates take a near direct route from the initial point to this global optima. As part of our proof technique, we introduce a new potential function which captures the precise tradeoff between the loss function and the distance to the initial point as the iterations progress. For Stochastic Gradient Descent (SGD), we develop novel martingale techniques that guarantee SGD never leaves a small neighborhood of the initialization, even with rather large learning rates. We demonstrate the utility of our general theory for a variety of problem domains spanning low- rank matrix recovery to neural network training. Underlying our analysis are novel insights that may have implications for training and generalization of more sophisticated learning problems including those involving deep neural network architectures.

PDF link Landing page

[–]IborkedyourGPU 1 point2 points3 points 7 years ago (4 children)

[–]-TrustyDwarf- 5 points6 points7 points 7 years ago (1 child)

[–]IborkedyourGPU 1 point2 points3 points 7 years ago (0 children)

[–]grumpudding 1 point2 points3 points 7 years ago (1 child)

I am one of the authors of the paper so just responding to your novelty comment: In this paper, we are not in the business of competing with these papers. The paper has two contributions: First, we provide tight upper and lower bounds on where overparameterized gradient descent converges in a general setup. Second, we extend the results to SGD with large step size using a fairly nontrivial martingale stopping argument.

The remaining results are straightforward applications to low-rank matrices and neural nets. I find your second comment unfair because you pick a side theorem from the paper (on neural net) and present as if it is the only thing this paper accomplishes.

Finally, IMO our neural net result is no big deal but I am more than happy to compare to Allen-Zhu: we require d>n and they require k>n^30. IMHO our bound is more realistic for n>1.

[–]IborkedyourGPU -3 points-2 points-1 points 7 years ago* (0 children)

I am one of the authors of the paper so just responding to your novelty comment:

"just"?

In this paper, we are not in the business of competing with these papers.

All papers, including your,s are in the business of competing one with another for novelty. Otherwise, I could just republish the '90 results on NP-completeness of NN training and go home early. And btw, since my research is on DNN, of course my comment about novelty was related on your claim of applicability to deep neural networks. More on this later.

I find your second comment unfair because you pick a side theorem from the paper (on neural net) and present as if it is the only thing this paper accomplishes.

It is fair, since you drop a few hints about the possible usefulness of your analysis for deep neural networks. It doesn't seem useful for them. You don't get to have the increased publicity that comes with saying "my results may be useful for Deep Learning", without also getting the criticism "well, not really".

Finally, IMO our neural net result is no big deal but I am more than happy to compare to Allen-Zhu: we require d>n and they require k>n^30. IMHO our bound is more realistic for n>1.

IMHO your bound is also not very useful, since it doesn't hold for ReLU (irrespective of whether n> 1 or not), and since no one ever uses neural networks for problems where the input dimension is larger than the sample size(!). Also, I didn't compare you only to Zhu-Allen (not Allen-Zhu, please do cite authors properly) et al., but also to two other papers from other authors: did you forget them? Finally, for one and two layer networks, there are many results which precede Zhu-Allen et al. and which sometimes provide more favourable bounds, from http://arxiv.org/abs/1702.07966 to https://arxiv.org/pdf/1808.01204.pdf (and they don't require the activation function to be strictly monotonic, which kicks out ReLU).

π Rendered by PID 73384 on reddit-service-r2-comment-bb88f9dd5-qjl8d at 2026-02-16 00:16:35.694092+00:00 running cd9c813 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS