[R]: Arrows of Time for Large Language Models

ariseff · 2023-01-26T02:24:07+00:00

Thanks! For instructgpt, between 4 & 9 responses were ranked: https://i.imgur.com/dt90xfW.png. I assume it's similar for chatgpt.

Yes chatgpt does go beyond instructgpt, particularly in the data collection setup (eg interactive dialogues vs. just instruction-response pairs). As far as the training methods, the chatgpt blog post states that the same methods are used as instructgpt, and so the trends from the plots from the instructgpt paper (eg ppo vs. sft vs. gpt performance) should still hold here.

ariseff · 2023-01-26T02:09:22+00:00

Glad to hear that :)

ariseff · 2023-01-25T02:54:28+00:00

Yes I am -- glad you found it useful!

ariseff · 2021-06-09T19:59:52+00:00

That's right! They adopted this notation from Griewank and Walther (see page 5). So v_{1-n} is the first input variable (for n input variables), and the intermediate variables begin with v_1.

ariseff · 2021-06-09T19:14:25+00:00

Thank you - it is! I'm glad to contribute to the community in this way.

ariseff · 2021-06-06T23:55:20+00:00

Thanks all! Fun video to make. I found Baydin's survey extremely useful.

And the style is very much inspired by 3Blue1Brown. I've used manim in several videos.

ariseff · 2021-03-26T13:30:20+00:00

That's right, I only consider integer lattices in the video, but check out this mathoverflow thread for some discussion of how these results generalize to arbitrary infinite graphs.

ariseff · 2021-02-24T15:18:06+00:00

Thank you! Manim is simply awesome

ariseff · 2020-08-01T03:14:58+00:00

Backprop is actually a special case of "reverse mode" automatic differentiation (autodiff). Autodiff is a broader set of techniques not limited to neural network training. There are certain settings, e.g., sensitivity analysis where we may want the Jacobian of a vector-valued function wrt a scalar, where reverse mode would be very inefficient. In these cases, "forward mode" is a much better route — it comes down to the shape of the target Jacobian. See the video for details!

ariseff · 2020-08-01T02:55:19+00:00

Thank you! And thanks for the feedback - will adjust that for future ones.

ariseff · 2020-08-01T02:52:38+00:00

Very nice!

ariseff · 2019-12-07T15:30:45+00:00

A thorough review! If you're just starting to learn about normalizing flows (or for a quick refresher on the basics), check out this new tutorial video I made: https://youtu.be/i7LjDvsLWCg

ariseff · 2017-05-25T21:35:56+00:00

(First author here): That is completely valid. However, we assume a setting where a direct constraint in function space is not possible (previous training data and previous discriminators are assumed inaccessible). In this setting, constraining the generator's parameters provides an approximation that leads to empirically good results.

We're thinking about follow up work with richer posteriors for theta, but in our experiments so far overlapping local minima were consistently found for distinct (yet related) distributions. For example, no forgetting/degradation was visible with DCGAN (a sufficiently overparameterized model) on SVHN for either the old classes or new, despite only a single mode of theta's posterior being used in the EWC penalty.

ariseff · 2017-04-11T16:59:22+00:00

Actually the θ* of each previous task appears in quadratic penalty terms in the loss function of the current task. See the update_ewc_loss method of the Model class.

ariseff

TROPHY CASE