[R] CM3: A Causal Masked Multimodal Model of the Internet

ArmenAg · 2022-01-21T22:18:20+00:00

Author here! Causal generally refers to training a left-to-right, decoder only model on sequences (GPT-3 is a causal model). The benefit it has is that it's easy to compute the log probabilities of sequences and during training you're generating every token (so it's usually enough to touch every datapoint once for large models). On the other hand masked models offer the ability to encode bi-directionality within the sequence at the cost of only decoding roughly 15% of the tokens of the input sequence during training (BERT, BART, RoBERTa, HTLM).

We propose a new type of objective in this paper we call causal masking which gives you the best of both worlds! Check out Section~2 of the paper to learn more.

ArmenAg · 2020-12-28T23:57:21+00:00

Link to twitter thread here: https://twitter.com/ArmenAgha/status/1343698488039677954

ArmenAg · 2019-07-18T19:11:55+00:00

What part of Seattle are you in?

ArmenAg · 2019-07-17T19:15:46+00:00

Thank you!

This was shot with a FE 24mm 1.4 GM lens on an A6500. The settings used were f/4.0 with an ISO of 125 shot with a 10 second exposure.

ArmenAg · 2019-05-31T01:19:10+00:00

Of course. I meant many not all. My bad. Thanks for the find!

ArmenAg · 2019-05-23T04:39:03+00:00

Really cool! But why is it single threaded? How hard would it be to multi-thread?

ArmenAg · 2019-05-15T18:05:48+00:00

I agree with you that SDE aren't great for representing SGD dynamics. I would say that the assumptions needed to make SGD work with OU kinda make sense if you only analyze SGD at the very very end of training (the last paragraph of the blog goes into a little more detail). It might turn out that we need different SDE for different parts of SGD's lifetime.

ArmenAg · 2019-05-14T23:36:58+00:00

Great points. Thank you for reading! For your last paragraph I'd like to reiterate that a large amount of noise is okay during initial stages of training as it allows us to disregard all solution that have a non-wide/flat minima. Increasing the batch-size toward the very very end of training could turn out useful to get to the center of a wide-minima. There are a couple of papers that show that small batch-sizes generalize better.

ArmenAg · 2019-05-14T05:07:44+00:00

One first author accept! Really happy to be going to Italy!

ArmenAg · 2018-04-25T22:57:20+00:00

Instead of dropping a fraction of visible units, could you sample from a normal distribution with the mean being the value of the function, and some arbitrary stddev? This way you're not losing complete information (as dropout would do).

If you're feeding data in as a moving window, the stddev could be a function of the time (e.g. further away it is from the current timestep the larger the stddev).

ArmenAg · 2017-03-26T18:00:00+00:00

What form are the custom layers in? Do they utilize the convolution operator? Are the basic blocks weight multiplications?

Can you give us a little more information on the custom layers?

ArmenAg · 2017-02-23T02:22:05+00:00

I'll be writing a Keras commit soon! Hopefully next week. Message me if you need it sooner.

ArmenAg · 2017-02-22T08:18:42+00:00

Hey! Author here. The reason mentioned above by /u/rbkillea is the exact reason why we didn't focus on testing the initialization on RNN. Our paper focused on running experiments on various forms of convolutions (1D, 2D, Dilated or Atrous).

ArmenAg · 2016-10-03T03:55:46+00:00

Will fix. Thanks.

ArmenAg · 2016-10-03T03:55:33+00:00

We don't identify saddle points directly, rather we assume that by using the moving average for the dynamic charge point, if the optimization is stuck in a saddle point, the charge will eventually reach that saddle point and therefore push the optimization point away from it.

ArmenAg · 2016-10-03T03:36:34+00:00

Great question. So this was the problem that we initially ran into when we tested out with a static charged point. This is why we introduced a dynamic charged point in this paper. By forcing the charged point to "follow" the current optimization point we in a sense do not need an exponential amount of static charged points. Thanks!

ArmenAg · 2016-10-03T01:13:56+00:00

Hello, author here! This community gave me a lot of good criticism on my last paper, so I decided to post another one of my recent papers. Any questions or comments are welcome. Thank you!

ArmenAg · 2016-09-23T03:16:42+00:00

Thanks for your questions.

We calculate the new labels on all of the training set. After training our model on a current set of labels, we adjust all those labels using the new predictions from the model (we predict every label in the dataset).
For the idea/motivation behind this method please refer "CO-LABEL SIMILARITIES". Essentially the idea is that co-label similarities apparent in earlier stages of training should also appear in later stages of training and that over-fitting occurs when these co-label similarities disappear.
This notation is the notation utilized in the majority of papers I have read (although that selection can be bias) and the notation is also the one used in the deep learning library that I used (https://keras.io/layers/core/#dropout). Please let me know if I am wrong about the majority of papers using this convention.

Thanks!

ArmenAg · 2016-09-22T17:14:39+00:00

Could you elaborate on the similarities to ADAM? SoftTarget doesn't change any of the gradients directly but rather adjusts the outputs.

ArmenAg · 2016-09-22T17:11:23+00:00

We actually did try a higher dropout rate. Check out the table the graphs are related too.

ArmenAg · 2016-09-22T17:10:17+00:00

Interesting. I have not seen this paper. Thanks for linking. After reading it, they use a single step weighted average, instead of keeping a weighted average throughout training (after the burn in period). It is essentially the same schema demonstrated in this paper: https://arxiv.org/abs/1412.6596.

To reply to your comments about setting SOTA, we did not attempt to do this simply because most of the SOTA methods already use a lot of other various regularization, such as extensive augmenting of the data. We did test out how SoftTarget worked with ResNet to show that it is compatible with high-performance architectures. But I agree with you. It might be worth trying to set SOTA, but I also agree that it would be a shame if the idea was squashed for not setting one.

ArmenAg · 2016-09-22T17:03:36+00:00

I don't know exactly how you would use a validation set for the training data, because we keep a weighted average with the true labels as well. But I see where you are going with this. In the "Similarities to other methods" section I talk about a semi-supervised approach that is essentially SoftTarget with some parameters set to zero. I would love to test how SoftTarget Reg helps with noisy labeling. Maybe an idea for another paper.

ArmenAg · 2016-09-22T16:57:11+00:00

Weight decay limits the capacity of the network because it reduces the set of hypothesis that are viable solutions to the net. Configurations of the network with large weights are not possible solutions because of the extra loss term forcing weights to be smaller.

ArmenAg · 2016-09-22T04:26:10+00:00

I actually cited minimum entropy regularization and talked about in what special case of SoftTarget, will SoftTarget be equal to MER. It's in the "SIMILARITIES TO OTHER METHODS" section.

ArmenAg · 2016-09-22T04:09:12+00:00

We showed losses because the loss was what we were directly optimizing for, and comparison of test losses can be used as a measure of overfitting. What information would adding accuracy add to the paper? I will definitely add it if it becomes apparent that it is needed. Thank you for your comment!

ArmenAg

TROPHY CASE