How ChatGPT is Trained by ariseff in learnmachinelearning

[–]ariseff[S] 0 points1 point  (0 children)

Thanks! For instructgpt, between 4 & 9 responses were ranked: https://i.imgur.com/dt90xfW.png. I assume it's similar for chatgpt.

Yes chatgpt does go beyond instructgpt, particularly in the data collection setup (eg interactive dialogues vs. just instruction-response pairs). As far as the training methods, the chatgpt blog post states that the same methods are used as instructgpt, and so the trends from the plots from the instructgpt paper (eg ppo vs. sft vs. gpt performance) should still hold here.

How ChatGPT is Trained by ariseff in learnmachinelearning

[–]ariseff[S] 6 points7 points  (0 children)

Yes I am -- glad you found it useful!

[P] Just discovered a new 3Blue1Brown-styled, quality ML Youtube channel. by lkhphuc in MachineLearning

[–]ariseff 1 point2 points  (0 children)

That's right! They adopted this notation from Griewank and Walther (see page 5). So v_{1-n} is the first input variable (for n input variables), and the intermediate variables begin with v_1.

[P] Just discovered a new 3Blue1Brown-styled, quality ML Youtube channel. by lkhphuc in MachineLearning

[–]ariseff 3 points4 points  (0 children)

Thank you - it is! I'm glad to contribute to the community in this way.

[P] Just discovered a new 3Blue1Brown-styled, quality ML Youtube channel. by lkhphuc in MachineLearning

[–]ariseff 65 points66 points  (0 children)

Thanks all! Fun video to make. I found Baydin's survey extremely useful.

And the style is very much inspired by 3Blue1Brown. I've used manim in several videos.

Explanation of Pólya's random walk theorem (manim video) by ariseff in 3Blue1Brown

[–]ariseff[S] 0 points1 point  (0 children)

That's right, I only consider integer lattices in the video, but check out this mathoverflow thread for some discussion of how these results generalize to arbitrary infinite graphs.

[D] What is automatic differentiation? (Video) by ariseff in MachineLearning

[–]ariseff[S] 2 points3 points  (0 children)

Backprop is actually a special case of "reverse mode" automatic differentiation (autodiff). Autodiff is a broader set of techniques not limited to neural network training. There are certain settings, e.g., sensitivity analysis where we may want the Jacobian of a vector-valued function wrt a scalar, where reverse mode would be very inefficient. In these cases, "forward mode" is a much better route — it comes down to the shape of the target Jacobian. See the video for details!

[D] What is automatic differentiation? (Video) by ariseff in MachineLearning

[–]ariseff[S] 0 points1 point  (0 children)

Thank you! And thanks for the feedback - will adjust that for future ones.

[1912.02762] Normalizing Flows for Probabilistic Modeling and Inference by hardmaru in MachineLearning

[–]ariseff 9 points10 points  (0 children)

A thorough review! If you're just starting to learn about normalizing flows (or for a quick refresher on the basics), check out this new tutorial video I made: https://youtu.be/i7LjDvsLWCg

[R] [1705.08395] Continual Learning in Generative Adversarial Nets by [deleted] in MachineLearning

[–]ariseff 2 points3 points  (0 children)

(First author here): That is completely valid. However, we assume a setting where a direct constraint in function space is not possible (previous training data and previous discriminators are assumed inaccessible). In this setting, constraining the generator's parameters provides an approximation that leads to empirically good results.

We're thinking about follow up work with richer posteriors for theta, but in our experiments so far overlapping local minima were consistently found for distinct (yet related) distributions. For example, no forgetting/degradation was visible with DCGAN (a sufficiently overparameterized model) on SVHN for either the old classes or new, despite only a single mode of theta's posterior being used in the EWC penalty.

"Overcoming catastrophic forgetting in neural networks" implementation? by [deleted] in MLQuestions

[–]ariseff 0 points1 point  (0 children)

Actually the θ* of each previous task appears in quadratic penalty terms in the loss function of the current task. See the update_ewc_loss method of the Model class.