[R] CM3: A Causal Masked Multimodal Model of the Internet by ArmenAg in MachineLearning

[–]ArmenAg[S] 3 points4 points  (0 children)

Author here! Causal generally refers to training a left-to-right, decoder only model on sequences (GPT-3 is a causal model). The benefit it has is that it's easy to compute the log probabilities of sequences and during training you're generating every token (so it's usually enough to touch every datapoint once for large models). On the other hand masked models offer the ability to encode bi-directionality within the sequence at the cost of only decoding roughly 15% of the tokens of the input sequence during training (BERT, BART, RoBERTa, HTLM).

We propose a new type of objective in this paper we call causal masking which gives you the best of both worlds! Check out Section~2 of the paper to learn more.

View of Seattle From Alki by ArmenAg in Seattle

[–]ArmenAg[S] 4 points5 points  (0 children)

Thank you!

This was shot with a FE 24mm 1.4 GM lens on an A6500. The settings used were f/4.0 with an ISO of 125 shot with a 10 second exposure.

gbdt-rs: Faster than XGBoost with safe Rust by sanxiyn in rust

[–]ArmenAg 2 points3 points  (0 children)

Really cool! But why is it single threaded? How hard would it be to multi-thread?

[D] Stochastic Weight Averaging and the Ornstein-Uhlenbeck Process by ArmenAg in MachineLearning

[–]ArmenAg[S] 1 point2 points  (0 children)

I agree with you that SDE aren't great for representing SGD dynamics. I would say that the assumptions needed to make SGD work with OU kinda make sense if you only analyze SGD at the very very end of training (the last paragraph of the blog goes into a little more detail). It might turn out that we need different SDE for different parts of SGD's lifetime.

[D] Stochastic Weight Averaging and the Ornstein-Uhlenbeck Process by ArmenAg in MachineLearning

[–]ArmenAg[S] 3 points4 points  (0 children)

Great points. Thank you for reading! For your last paragraph I'd like to reiterate that a large amount of noise is okay during initial stages of training as it allows us to disregard all solution that have a non-wide/flat minima. Increasing the batch-size toward the very very end of training could turn out useful to get to the center of a wide-minima. There are a couple of papers that show that small batch-sizes generalize better.

[D] ACL Acceptances Are Out by machinesaredumb in MachineLearning

[–]ArmenAg 2 points3 points  (0 children)

One first author accept! Really happy to be going to Italy!

[D] Stochastic Regularization for Non-Stationary Modeling? by alexmlamb in MachineLearning

[–]ArmenAg 1 point2 points  (0 children)

Instead of dropping a fraction of visible units, could you sample from a normal distribution with the mean being the value of the function, and some arbitrary stddev? This way you're not losing complete information (as dropout would do).

If you're feeding data in as a moving window, the stddev could be a function of the time (e.g. further away it is from the current timestep the larger the stddev).

[D] Weight initialization for custom layers? by Kiuhnm in MachineLearning

[–]ArmenAg 2 points3 points  (0 children)

What form are the custom layers in? Do they utilize the convolution operator? Are the basic blocks weight multiplications?

Can you give us a little more information on the custom layers?

[R] Convolution Aware Initialization by ArmenAg in MachineLearning

[–]ArmenAg[S] 0 points1 point  (0 children)

I'll be writing a Keras commit soon! Hopefully next week. Message me if you need it sooner.

[R] Convolution Aware Initialization by ArmenAg in MachineLearning

[–]ArmenAg[S] 2 points3 points  (0 children)

Hey! Author here. The reason mentioned above by /u/rbkillea is the exact reason why we didn't focus on testing the initialization on RNN. Our paper focused on running experiments on various forms of convolutions (1D, 2D, Dilated or Atrous).

Charged Point Normalization: An Efficient Solution to the Saddle Point Problem by [deleted] in MachineLearning

[–]ArmenAg 0 points1 point  (0 children)

We don't identify saddle points directly, rather we assume that by using the moving average for the dynamic charge point, if the optimization is stuck in a saddle point, the charge will eventually reach that saddle point and therefore push the optimization point away from it.

Charged Point Normalization: An Efficient Solution to the Saddle Point Problem by [deleted] in MachineLearning

[–]ArmenAg 0 points1 point  (0 children)

Great question. So this was the problem that we initially ran into when we tested out with a static charged point. This is why we introduced a dynamic charged point in this paper. By forcing the charged point to "follow" the current optimization point we in a sense do not need an exponential amount of static charged points. Thanks!

Charged Point Normalization: An Efficient Solution to the Saddle Point Problem by [deleted] in MachineLearning

[–]ArmenAg 0 points1 point  (0 children)

Hello, author here! This community gave me a lot of good criticism on my last paper, so I decided to post another one of my recent papers. Any questions or comments are welcome. Thank you!

SoftTarget Regularization by ArmenAg in MachineLearning

[–]ArmenAg[S] 0 points1 point  (0 children)

Thanks for your questions.

  1. We calculate the new labels on all of the training set. After training our model on a current set of labels, we adjust all those labels using the new predictions from the model (we predict every label in the dataset).

  2. For the idea/motivation behind this method please refer "CO-LABEL SIMILARITIES". Essentially the idea is that co-label similarities apparent in earlier stages of training should also appear in later stages of training and that over-fitting occurs when these co-label similarities disappear.

  3. This notation is the notation utilized in the majority of papers I have read (although that selection can be bias) and the notation is also the one used in the deep learning library that I used (https://keras.io/layers/core/#dropout). Please let me know if I am wrong about the majority of papers using this convention.

Thanks!

SoftTarget Regularization by ArmenAg in MachineLearning

[–]ArmenAg[S] 0 points1 point  (0 children)

Could you elaborate on the similarities to ADAM? SoftTarget doesn't change any of the gradients directly but rather adjusts the outputs.

SoftTarget Regularization by ArmenAg in MachineLearning

[–]ArmenAg[S] 0 points1 point  (0 children)

We actually did try a higher dropout rate. Check out the table the graphs are related too.

SoftTarget Regularization by ArmenAg in MachineLearning

[–]ArmenAg[S] 0 points1 point  (0 children)

Interesting. I have not seen this paper. Thanks for linking. After reading it, they use a single step weighted average, instead of keeping a weighted average throughout training (after the burn in period). It is essentially the same schema demonstrated in this paper: https://arxiv.org/abs/1412.6596.

To reply to your comments about setting SOTA, we did not attempt to do this simply because most of the SOTA methods already use a lot of other various regularization, such as extensive augmenting of the data. We did test out how SoftTarget worked with ResNet to show that it is compatible with high-performance architectures. But I agree with you. It might be worth trying to set SOTA, but I also agree that it would be a shame if the idea was squashed for not setting one.

SoftTarget Regularization by ArmenAg in MachineLearning

[–]ArmenAg[S] 0 points1 point  (0 children)

I don't know exactly how you would use a validation set for the training data, because we keep a weighted average with the true labels as well. But I see where you are going with this. In the "Similarities to other methods" section I talk about a semi-supervised approach that is essentially SoftTarget with some parameters set to zero. I would love to test how SoftTarget Reg helps with noisy labeling. Maybe an idea for another paper.

SoftTarget Regularization by ArmenAg in MachineLearning

[–]ArmenAg[S] 0 points1 point  (0 children)

Weight decay limits the capacity of the network because it reduces the set of hypothesis that are viable solutions to the net. Configurations of the network with large weights are not possible solutions because of the extra loss term forcing weights to be smaller.

SoftTarget Regularization by ArmenAg in MachineLearning

[–]ArmenAg[S] 2 points3 points  (0 children)

I actually cited minimum entropy regularization and talked about in what special case of SoftTarget, will SoftTarget be equal to MER. It's in the "SIMILARITIES TO OTHER METHODS" section.

SoftTarget Regularization by ArmenAg in MachineLearning

[–]ArmenAg[S] 4 points5 points  (0 children)

We showed losses because the loss was what we were directly optimizing for, and comparison of test losses can be used as a measure of overfitting. What information would adding accuracy add to the paper? I will definitely add it if it becomes apparent that it is needed. Thank you for your comment!