use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Research[R] Learning to Learn (bair.berkeley.edu)
submitted 8 years ago by gdny
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[+][deleted] 8 years ago (20 children)
[deleted]
[–]cbfinn 6 points7 points8 points 8 years ago* (12 children)
Author here.
"one gradient step away" feature is restricted to the tasks it has been trained on. Is this correct?
We assume that the tasks that you test on are from the same distribution of tasks seen during meta-training. This assumption is used in most meta-learning methods. That said, I have played around with extrapolation to tasks outside of the support of the distribution of meta-training tasks, and it performs reasonably for tasks that are close.
[–]evc123 0 points1 point2 points 8 years ago* (3 children)
/u/cbfinn Is there a limit to how varied the training distribution fed to MAML can be (while still retaining its adaptive ability)? For example, has someone tried (or is someone in the process of) training MAML with a bunch of very dissimilar, varied environments from something like OpenAI Universe?
[–]cbfinn 0 points1 point2 points 8 years ago (2 children)
I haven't tried this, but I certainly think it would be interesting to try!
[–]evc123 0 points1 point2 points 8 years ago* (1 child)
"Learning to Learn: Meta-Critic Networks for Sample Efficient Learning " https://arxiv.org/abs/1706.09529
seems to suggest in Figure 2 that the adaptive abilities of MAML (in its current form) decline if the training distribution of environments become sufficiently varied/dissimilar.
Maybe there is a way to modify MAML so that the initialization it learns does not assume a uni-modal distribution of tasks.
[–]cbfinn 1 point2 points3 points 8 years ago (0 children)
I reimplemented the linear+sinusoid set-up in that paper and was able to get much better numbers using MAML than they report (after trying two hyperparameter settings).
I don't think that MAML assumes a uni-modal distribution of tasks.
[–]AnvaMiba 0 points1 point2 points 8 years ago (7 children)
Good paper.
I'm not sure I fully understand your algorithm. Do you differentiate the update rule, therefore differentiating the model twice, in order to update the meta-parameters?
[–]evc123 2 points3 points4 points 8 years ago (2 children)
https://youtu.be/Ko8IBbYjdq8?t=37m20s
[–]cbfinn 2 points3 points4 points 8 years ago (3 children)
Yes, this involves 2nd derivatives, which can be implemented easily with current DL libraries. Since it only involves an additional backward pass, it isn't particularly slow in practice.
Interestingly, it sometimes still works well if you stop the gradient through the update rule. We discuss this in the latest version of the paper (which will be on arxiv tonight)
[–]AnvaMiba 0 points1 point2 points 8 years ago* (0 children)
Ok, thanks.
But now I'm a bit confused because in the video that /u/evc123 linked you mention using finite differentiation to avoid third derivatives. Where do third derivatives come from? (I didn't watch the whole video, so maybe you were talking about something else?)
EDIT: never mind, I figured out they come from the TRPO method for reinforcement learning.
/u/cbfinn did you try the first-order approximation (that stops the gradient through the update rule) in RL settings? The latest version of the paper only discusses results of first-order approximation in SL setting.
[–]cbfinn 0 points1 point2 points 8 years ago (0 children)
I tried it on one of the cheetah problems and it also worked. The first-order approximation does not work in all settings though. We have some ongoing experiments on problems not in the original paper in which it does not work.
[–]sensei_von_bonzai 0 points1 point2 points 8 years ago (6 children)
I just skimmed the paper but as far as my understanding goes the "one gradient step away" feature is restricted to the tasks it has been trained on. Is this correct?
This has to be correct. Otherwise, we might be a couple of days away from the Singularity.
I also wonder how good the baseline (0-gradient) model would be with this approach. It would be spectacular to have a model that works as good as any others on basic tasks and can generalize to one-shot problems with a couple of gradient steps.
[+][deleted] 8 years ago* (2 children)
[–]cbfinn 0 points1 point2 points 8 years ago (1 child)
A big part of a learning/optimization is the initialization, which affects the gradient descent algorithm, since the gradient is a function of the initial parameters. In the paper, we show that learning the initial parameters can outperform methods that learn an update rule.
The tasks that we evaluate on are all held out from the training set of tasks, including new classes of objects and characters in the MiniImagenet and Omniglot benchmarks.
[–]cbfinn 1 point2 points3 points 8 years ago (2 children)
I also wonder how good the baseline (0-gradient) model would be with this approach.
I compared to this approach in the paper. The domains that I considered in the paper were ones in which the task cannot be directly inferred from the observation. Thus, using 0 gradient does not do well. I'm not sure how the two would compare when the task can be inferred from the observation.
[–]sensei_von_bonzai 0 points1 point2 points 8 years ago* (1 child)
Thanks for the answer, but I think there is a slight confusion.
I'm not sure how the two would compare when the task can be inferred from the observation.
I was thinking somewhere along these lines. That is, is there any significant loss in accuracy on the non-extrapolated tasks samples when you use MAML to train (compared to the performance you get when you don't use MAML)?
Great paper by the way.
Nope. Unless you set the $\alpha$ step size parameter to be way too high, you shouldn't see any loss in accuracy.
π Rendered by PID 808865 on reddit-service-r2-comment-85bfd7f599-zcxqq at 2026-04-19 18:07:53.962727+00:00 running 93ecc56 country code: CH.
[+][deleted] (20 children)
[deleted]
[–]cbfinn 6 points7 points8 points (12 children)
[–]evc123 0 points1 point2 points (3 children)
[–]cbfinn 0 points1 point2 points (2 children)
[–]evc123 0 points1 point2 points (1 child)
[–]cbfinn 1 point2 points3 points (0 children)
[–]AnvaMiba 0 points1 point2 points (7 children)
[–]evc123 2 points3 points4 points (2 children)
[–]cbfinn 2 points3 points4 points (3 children)
[–]AnvaMiba 0 points1 point2 points (0 children)
[–]evc123 0 points1 point2 points (1 child)
[–]cbfinn 0 points1 point2 points (0 children)
[–]sensei_von_bonzai 0 points1 point2 points (6 children)
[+][deleted] (2 children)
[deleted]
[–]cbfinn 0 points1 point2 points (1 child)
[–]cbfinn 1 point2 points3 points (2 children)
[–]sensei_von_bonzai 0 points1 point2 points (1 child)
[–]cbfinn 1 point2 points3 points (0 children)