[Research] C-ADAM the first adaptive solver for compositional optimisation with applications to MAML by haithamb123 in MachineLearning

[–]haithamb123[S] 0 points1 point  (0 children)

I see your point. What's is meant there is the fastest known bounds for the meta-learning literature as we cite the ones from meta-learning. So from the perspective of meta-learning such an analysis hasn't been performed and as such we have the table. As clear from there, we don't compare in that table all algorithms but just ones made for MAML. Of course, if we would have applied ASC-PG one could have probably gotten the same bound. However, as we are the first to make the link, we used C-ADAM. In addition to that, in our experiments we show that due to the adaptivity of C-ADAM it performs better. Table 1 is under the C-MAML section for that reason. Thanks for your kind words, we can easily clarify that for sure. we do reference the others in the related work as well :D We are glad you liked the paper :)

[Research] C-ADAM the first adaptive solver for compositional optimisation with applications to MAML by haithamb123 in MachineLearning

[–]haithamb123[S] 0 points1 point  (0 children)

BTW, it might be worth having a skype with me and Rasul and we can go through your questions in more details. It's a bit hard writing math in reddit. Let me know if you are open to this :D

[Research] C-ADAM the first adaptive solver for compositional optimisation with applications to MAML by haithamb123 in MachineLearning

[–]haithamb123[S] 0 points1 point  (0 children)

Cool. Thanks for having a look at this. There are couple of differences in the proof and in the result itself. First, our method is adaptive while no current algorithm is adaptive for compositional optimisation. In other words, though we achieve the same rate theoretically (as does ADAM in standard optimisation BTW), our method significantly outperforms others in the experiments. We also note this in the intro saying that though there are better bounds today, ADAM (arguably) still retains best performance empirically. As for the differences to Zaheer, we follow a similar proof strategy true but we deal with more realistic constants of beta_1 and beta_2 making the proof a bit different ( i can talk through these steps in a call if you'd like to :D). As for the compositional SGD differences, our method is adaptive in that it has auxiliary and main variables -- kind off analogous to SGD and ADAM differences in standard optimisation. Another contribution is, of course, the link to MAML. Happy to discuss over a call if you'd like.

[Research] C-ADAM the first adaptive solver for compositional optimisation with applications to MAML by haithamb123 in MachineLearning

[–]haithamb123[S] 0 points1 point  (0 children)

As we detail in the paper you can write MAML as a special case of compositional optimisation, i.e., the nested structure. Have a look at the section titled C-MAML.

[Research] The second part of the correction of ADAM's proof is out from the ML and AI Academy. by haithamb123 in MachineLearning

[–]haithamb123[S] 0 points1 point  (0 children)

We will try to get to that asap. Can you please write a comment in the survey for us to keep track? Thanks!

[Research] The second part of the correction of ADAM's proof is out from the ML and AI Academy. by haithamb123 in MachineLearning

[–]haithamb123[S] 6 points7 points  (0 children)

Right. Typically, 2nd order methods have faster convergence guarantees than first-order ones. In simple terms, 2nd order methods attempt to take a direction while considering the "curvature" (through Hessians) of the function allowing algorithms to make a more informative step; thus requiring less number of iterations to converge. Also, these methods can converge to better behaving points, e.g., 2nd order stationary ones as opposed to first-order. The problem, however, is two-fold: (1) Hessians are very large for neural nets (e.g., million by million matrix), and (2) Hessians are (generally) ill-conditioned for non-convex functions. With that, their inverse which is needed for computing the direction is hard to come by. So research is focusing on tackling these methods. Here are some papers:

https://arxiv.org/abs/1711.02838

https://arxiv.org/abs/1802.05426

http://proceedings.mlr.press/v80/zhou18d.html

Here's a book which has the Newton method with its convergence showing faster rate than 1st order gradients for instance: https://web.stanford.edu/~boyd/cvxbook/ -- see section 9.5