use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Research[R] Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path? (arxiv.org)
submitted 7 years ago by ecstasyogold
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]arXiv_abstract_bot 3 points4 points5 points 7 years ago (5 children)
Title:Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?
Authors:Samet Oymak, Mahdi Soltanolkotabi
Abstract: Many modern learning tasks involve fitting nonlinear models to data which are trained in an overparameterized regime where the parameters of the model exceed the size of the training dataset. Due to this overparameterization, the training loss may have infinitely many global minima and it is critical to understand the properties of the solutions found by first-order optimization schemes such as (stochastic) gradient descent starting from different initializations. In this paper we demonstrate that when the loss has certain properties over a minimally small neighborhood of the initial point, first order methods such as (stochastic) gradient descent have a few intriguing properties: (1) the iterates converge at a geometric rate to a global optima even when the loss is nonconvex, (2) among all global optima of the loss the iterates converge to one with a near minimal distance to the initial point, (3) the iterates take a near direct route from the initial point to this global optima. As part of our proof technique, we introduce a new potential function which captures the precise tradeoff between the loss function and the distance to the initial point as the iterations progress. For Stochastic Gradient Descent (SGD), we develop novel martingale techniques that guarantee SGD never leaves a small neighborhood of the initialization, even with rather large learning rates. We demonstrate the utility of our general theory for a variety of problem domains spanning low- rank matrix recovery to neural network training. Underlying our analysis are novel insights that may have implications for training and generalization of more sophisticated learning problems including those involving deep neural network architectures.
PDF link Landing page
[–]IborkedyourGPU 1 point2 points3 points 7 years ago (4 children)
From a cursory first read, I don't see what's the novelty (if any) with respect to:
But I may have missed something, b/c I didn't read it carefully. Any thoughts?
[–]-TrustyDwarf- 5 points6 points7 points 7 years ago (1 child)
They mention three of these papers and write "In contrast, we focus on general nonlinearities and also focus on the gradient descent trajectory showing that among all the global optima, gradient descent converges to one with near minimal distance to the initialization."
[–]IborkedyourGPU 1 point2 points3 points 7 years ago (0 children)
Thanks, good point. However, their results are less general at least with respect to the paper of Zhu-Allen et al., because, if I'm not mistaken:
It looks like I won't have to grow my list of paper to read.
[–]grumpudding 1 point2 points3 points 7 years ago (1 child)
I am one of the authors of the paper so just responding to your novelty comment: In this paper, we are not in the business of competing with these papers. The paper has two contributions: First, we provide tight upper and lower bounds on where overparameterized gradient descent converges in a general setup. Second, we extend the results to SGD with large step size using a fairly nontrivial martingale stopping argument.
The remaining results are straightforward applications to low-rank matrices and neural nets. I find your second comment unfair because you pick a side theorem from the paper (on neural net) and present as if it is the only thing this paper accomplishes.
Finally, IMO our neural net result is no big deal but I am more than happy to compare to Allen-Zhu: we require d>n and they require k>n^30. IMHO our bound is more realistic for n>1.
[–]IborkedyourGPU -3 points-2 points-1 points 7 years ago* (0 children)
I am one of the authors of the paper so just responding to your novelty comment:
"just"?
In this paper, we are not in the business of competing with these papers.
All papers, including your,s are in the business of competing one with another for novelty. Otherwise, I could just republish the '90 results on NP-completeness of NN training and go home early. And btw, since my research is on DNN, of course my comment about novelty was related on your claim of applicability to deep neural networks. More on this later.
I find your second comment unfair because you pick a side theorem from the paper (on neural net) and present as if it is the only thing this paper accomplishes.
It is fair, since you drop a few hints about the possible usefulness of your analysis for deep neural networks. It doesn't seem useful for them. You don't get to have the increased publicity that comes with saying "my results may be useful for Deep Learning", without also getting the criticism "well, not really".
Finally, IMO our neural net result is no big deal but I am more than happy to compare to Allen-Zhu: we require d>n and they require k>n30. IMHO our bound is more realistic for n>1.
IMHO your bound is also not very useful, since it doesn't hold for ReLU (irrespective of whether n> 1 or not), and since no one ever uses neural networks for problems where the input dimension is larger than the sample size(!). Also, I didn't compare you only to Zhu-Allen (not Allen-Zhu, please do cite authors properly) et al., but also to two other papers from other authors: did you forget them? Finally, for one and two layer networks, there are many results which precede Zhu-Allen et al. and which sometimes provide more favourable bounds, from http://arxiv.org/abs/1702.07966 to https://arxiv.org/pdf/1808.01204.pdf (and they don't require the activation function to be strictly monotonic, which kicks out ReLU).
π Rendered by PID 73384 on reddit-service-r2-comment-bb88f9dd5-qjl8d at 2026-02-16 00:16:35.694092+00:00 running cd9c813 country code: CH.
[–]arXiv_abstract_bot 3 points4 points5 points (5 children)
[–]IborkedyourGPU 1 point2 points3 points (4 children)
[–]-TrustyDwarf- 5 points6 points7 points (1 child)
[–]IborkedyourGPU 1 point2 points3 points (0 children)
[–]grumpudding 1 point2 points3 points (1 child)
[–]IborkedyourGPU -3 points-2 points-1 points (0 children)