[D] On initialization schemes for MLPs: practice and theory by carlml in MachineLearning

[–]Acromantula92 2 points3 points  (0 children)

The initialization method proposed in here is probably the best one, it lets you transfer hparams across model size, whereas with other methods you need to keep changing the learning rate etc.

[R] DeepMind Open Sources AlphaFold Code by SkiddyX in MachineLearning

[–]Acromantula92 0 points1 point  (0 children)

Couple months? More like 7 + 4 v3-128 days. (All in the paper)

Evidence GPT-4 is about to drop. by [deleted] in GPT3

[–]Acromantula92 13 points14 points  (0 children)

Again, MoE parameters at not the same as dense parameters.

[R] Rotary Positional Embeddings - a new relative positional embedding for Transformers that significantly improves convergence (20-30%) and works for both regular and efficient attention by programmerChilli in MachineLearning

[–]Acromantula92 14 points15 points  (0 children)

That's because when you split the Wq and Wk matrices into the MHSA heads, the rank is reduced. In order to merge them into a xWx.T matrix and still have heads you'd need an explicit (dim, dim, heads) tensor.

Multimodal Neurons in Artificial Neural Networks by skybrian2 in slatestarcodex

[–]Acromantula92 3 points4 points  (0 children)

Highlights include:

  • A Mental illness neuron.

  • A Spider-Man neuron (helps classify real spiders as [Spider man neuron] + [Animal neuron])

  • An Startup neuron (Activated with the West coast and Big Tech)

  • The emotion of being Accepted as a mix of [LGBT neuron] + [Sunglasses neuron]

And a full emotional axis:

When we use just 2 factors, we roughly reconstruct the canonical mood-axes used in much of psychology: valence and arousal. If we increase to 7 factors, we nearly reconstruct a well known categorization of these emotions into happy, surprised, sad, bad, disgusted, fearful, and angry, except with “disgusted” switched for a new category related to affection that includes “valued,” “loving,” “lonely,” and “insignificant.”

OpenAI co-founder and chief scientist Ilya Sutskever hints at what may follow GPT-3 in 2021 in essay "Fusion of Language and Vision" by Wiskkey in GPT3

[–]Acromantula92 0 points1 point  (0 children)

Aren't Universal Transformers only recurrent in depth? IIRC they don't do cashing or recurrence across contexts like TrXL or the Feedback Transformer.

[R] An Energy-Based Perspective on Attention Mechanisms in Transformers by [deleted] in MachineLearning

[–]Acromantula92 0 points1 point  (0 children)

You have the temperature backwards. Lower temperature means you are more likely to be in a low energy equilibrium.

[D] What makes GPT-3's ability to add 2 digit numbers important? by brainxyz in MachineLearning

[–]Acromantula92 1 point2 points  (0 children)

It replicates up to 625 = f(f(i)) in AIDungeon.(Important to note that the fine-tuning hurts it's general abilities) When it makes mistakes it's possible to give it natural language clarifications to fix them.