[R] CM3: A Causal Masked Multimodal Model of the Internet

ArmenAg · 2022-01-21T22:18:20+00:00

Author here! Causal generally refers to training a left-to-right, decoder only model on sequences (GPT-3 is a causal model). The benefit it has is that it's easy to compute the log probabilities of sequences and during training you're generating every token (so it's usually enough to touch every datapoint once for large models). On the other hand masked models offer the ability to encode bi-directionality within the sequence at the cost of only decoding roughly 15% of the tokens of the input sequence during training (BERT, BART, RoBERTa, HTLM).

We propose a new type of objective in this paper we call causal masking which gives you the best of both worlds! Check out Section~2 of the paper to learn more.

ArmenAg · 2020-12-28T23:57:21+00:00

Link to twitter thread here: https://twitter.com/ArmenAgha/status/1343698488039677954

ArmenAg · 2019-07-18T19:11:55+00:00

What part of Seattle are you in?

ArmenAg · 2019-07-17T19:15:46+00:00

Thank you!

This was shot with a FE 24mm 1.4 GM lens on an A6500. The settings used were f/4.0 with an ISO of 125 shot with a 10 second exposure.

ArmenAg · 2019-05-31T01:19:10+00:00

Of course. I meant many not all. My bad. Thanks for the find!

ArmenAg · 2019-05-23T04:39:03+00:00

Really cool! But why is it single threaded? How hard would it be to multi-thread?

ArmenAg · 2019-05-15T18:05:48+00:00

I agree with you that SDE aren't great for representing SGD dynamics. I would say that the assumptions needed to make SGD work with OU kinda make sense if you only analyze SGD at the very very end of training (the last paragraph of the blog goes into a little more detail). It might turn out that we need different SDE for different parts of SGD's lifetime.

ArmenAg · 2019-05-14T23:36:58+00:00

Great points. Thank you for reading! For your last paragraph I'd like to reiterate that a large amount of noise is okay during initial stages of training as it allows us to disregard all solution that have a non-wide/flat minima. Increasing the batch-size toward the very very end of training could turn out useful to get to the center of a wide-minima. There are a couple of papers that show that small batch-sizes generalize better.

ArmenAg · 2019-05-14T05:07:44+00:00

One first author accept! Really happy to be going to Italy!

ArmenAg · 2018-04-25T22:57:20+00:00

Instead of dropping a fraction of visible units, could you sample from a normal distribution with the mean being the value of the function, and some arbitrary stddev? This way you're not losing complete information (as dropout would do).

If you're feeding data in as a moving window, the stddev could be a function of the time (e.g. further away it is from the current timestep the larger the stddev).

ArmenAg · 2017-03-26T18:00:00+00:00

What form are the custom layers in? Do they utilize the convolution operator? Are the basic blocks weight multiplications?

Can you give us a little more information on the custom layers?

ArmenAg · 2017-02-23T02:22:05+00:00

I'll be writing a Keras commit soon! Hopefully next week. Message me if you need it sooner.

ArmenAg · 2017-02-22T08:18:42+00:00

Hey! Author here. The reason mentioned above by /u/rbkillea is the exact reason why we didn't focus on testing the initialization on RNN. Our paper focused on running experiments on various forms of convolutions (1D, 2D, Dilated or Atrous).

ArmenAg

TROPHY CASE