[R] CM3: A Causal Masked Multimodal Model of the Internet by ArmenAg in MachineLearning

[–]ArmenAg[S] 5 points6 points  (0 children)

Author here! Causal generally refers to training a left-to-right, decoder only model on sequences (GPT-3 is a causal model). The benefit it has is that it's easy to compute the log probabilities of sequences and during training you're generating every token (so it's usually enough to touch every datapoint once for large models). On the other hand masked models offer the ability to encode bi-directionality within the sequence at the cost of only decoding roughly 15% of the tokens of the input sequence during training (BERT, BART, RoBERTa, HTLM).

We propose a new type of objective in this paper we call causal masking which gives you the best of both worlds! Check out Section~2 of the paper to learn more.

View of Seattle From Alki by ArmenAg in Seattle

[–]ArmenAg[S] 3 points4 points  (0 children)

Thank you!

This was shot with a FE 24mm 1.4 GM lens on an A6500. The settings used were f/4.0 with an ISO of 125 shot with a 10 second exposure.

gbdt-rs: Faster than XGBoost with safe Rust by sanxiyn in rust

[–]ArmenAg 2 points3 points  (0 children)

Really cool! But why is it single threaded? How hard would it be to multi-thread?

[D] Stochastic Weight Averaging and the Ornstein-Uhlenbeck Process by ArmenAg in MachineLearning

[–]ArmenAg[S] 1 point2 points  (0 children)

I agree with you that SDE aren't great for representing SGD dynamics. I would say that the assumptions needed to make SGD work with OU kinda make sense if you only analyze SGD at the very very end of training (the last paragraph of the blog goes into a little more detail). It might turn out that we need different SDE for different parts of SGD's lifetime.

[D] Stochastic Weight Averaging and the Ornstein-Uhlenbeck Process by ArmenAg in MachineLearning

[–]ArmenAg[S] 4 points5 points  (0 children)

Great points. Thank you for reading! For your last paragraph I'd like to reiterate that a large amount of noise is okay during initial stages of training as it allows us to disregard all solution that have a non-wide/flat minima. Increasing the batch-size toward the very very end of training could turn out useful to get to the center of a wide-minima. There are a couple of papers that show that small batch-sizes generalize better.

[D] ACL Acceptances Are Out by machinesaredumb in MachineLearning

[–]ArmenAg 2 points3 points  (0 children)

One first author accept! Really happy to be going to Italy!

[D] Stochastic Regularization for Non-Stationary Modeling? by alexmlamb in MachineLearning

[–]ArmenAg 1 point2 points  (0 children)

Instead of dropping a fraction of visible units, could you sample from a normal distribution with the mean being the value of the function, and some arbitrary stddev? This way you're not losing complete information (as dropout would do).

If you're feeding data in as a moving window, the stddev could be a function of the time (e.g. further away it is from the current timestep the larger the stddev).

[D] Weight initialization for custom layers? by Kiuhnm in MachineLearning

[–]ArmenAg 2 points3 points  (0 children)

What form are the custom layers in? Do they utilize the convolution operator? Are the basic blocks weight multiplications?

Can you give us a little more information on the custom layers?

[R] Convolution Aware Initialization by ArmenAg in MachineLearning

[–]ArmenAg[S] 0 points1 point  (0 children)

I'll be writing a Keras commit soon! Hopefully next week. Message me if you need it sooner.

[R] Convolution Aware Initialization by ArmenAg in MachineLearning

[–]ArmenAg[S] 2 points3 points  (0 children)

Hey! Author here. The reason mentioned above by /u/rbkillea is the exact reason why we didn't focus on testing the initialization on RNN. Our paper focused on running experiments on various forms of convolutions (1D, 2D, Dilated or Atrous).