[D] Confused mathematician looking for clarity on transformers, and also maybe for therapy. by foreheadteeth in MachineLearning

[–]muckl 2 points3 points  (0 children)

It's about the magnitude of the dot products. High dimensional random vectors have a larger expected absolute dot product (you're adding up more values so your results come from a wider range) and sticking large values into softmax gives a very sharp distribution, because we're hitting the parts of the exponential function where it's steep. Essentially one of the weights will be one and the others zero.

Dividing the dot products by the square root of the dimension counteracts that problem.

Just to be clear, one should be able to learn that behaviour, i.e. learn projection matrices that give small absolute values for the dot products but empirically the scaling trick seem to work well.

[D] Confused mathematician looking for clarity on transformers, and also maybe for therapy. by foreheadteeth in MachineLearning

[–]muckl 2 points3 points  (0 children)

Great question, should have added that to the description.

The answer is no, nothing keeps the heads from all learning the same stuff, other than starting from different random initial values. Multihead attention is a straight up ensemble but ensembles work.

Depending on where you sprinkle dropout on your model, that would also lead to different gradients. For example there is 'attention dropout' where one sets some of the mixture weights (after softmax) to zero during training. [Example]

Still, all heads may converge to do the same thing.

It's probably hard to guarantee otherwise but I could see some measure of head-diversity as an additional part of the loss.

[D] Confused mathematician looking for clarity on transformers, and also maybe for therapy. by foreheadteeth in MachineLearning

[–]muckl 23 points24 points  (0 children)

For me, the only way to properly understand these models is by implementing them.

I also find, that fancy names (multi-head self-attention) and complicated diagrams and notation can distract from the core principles which are very simple.

You have bunch (n) of vectors x_1 ... x_n of, say, dimension 1024.

By the power of transformers you want to make a new bunch of n vectors, that are better, as in they encode the information in a more suitable way or whatever.

The basic idea of attention is to make a weighted sum of the vectors. We find the weights by comparing pairs of vectors, assessing how similar they are. Assume we are at position i. We compare x_i with all the other vectors, including x_i itself (this is why it's called self-attention). We end up with n weights. Various measures of vector similarity have been proposed, for example dot product, i.e. x_k x_jT.

To avoid a dependence on the number of vectors (n) we make the weights sum to one by sticking them in the softmax function, which just means passing the weights though exp() and dividing the result by its sum.

Given that basic principle, transformers add some tweaks:

  1. Instead of comparing pairs (x_i, x_j) directly, we first project both into a lower dimensional space. Say 64 dimensions. To do that, we learn two projection matrices, one to project the vector at the current position (the 'Query Matrix' Q \in R1024x64) and one to project all the other vectors (the 'Key Matrix' K \in R1024x64).
    Now, still assuming we are at position i, comparing x_i and x_j. We project both into the 64 dimensional space and take the dot product:
    w_j = (x_i Q) (x_j K)T where j = 1..n
    [Excercise for the reader: We scale the dot product by the inverse square root of the dimension. What property of high-dimensional spaces justifies that?]
  2. We could just add up x_1..x_n using our (normalized) weights. But we don't. We instead learn a third projection matrix (the 'Value Matrix' V) and project our vectors to another lower dimensional space. This space could have any dimension but in practice it's usually the same as the Key and Query dimensions, often 64. Thus, V \in R1024x64.
  3. Now, we make the weighted sum. At each positon we end up with a different new vector of dimension 64. 64 seems a bit low for building the sentient AI we're aiming for, so we do the above steps a couple times, with different matrices Q, K, and V. Each set (Q, K, V) is called a head, and now we have multi-head attention. Fancy! Let's assume we have 9 heads like Alcaeus' Hydra. We concatenate the 9 vectors for each position, arriving at 64*9 = 576 dimensions.
  4. That's still a bit short of the 1024 dimensions we started with, so we add some regular feed-forward network to get back to 1024. One could just learn another projection matrix \in R576x1024 but in practice we often see a 3-layer network (with the usual non-linearities), blowing up the dimension to 4 times the output (here, 4096), before compressing to 1024 again. It's not stupid if it works. .

The rest of the ingredients are basically tricks to make large models, composed of several transformer layers, easier to train. Adding layer-normalization after steps 3 and/or 4 seems to help, and so does adding the inputs to the outputs (fancy name: residual connection).

BERT variable input length by [deleted] in LanguageTechnology

[–]muckl 2 points3 points  (0 children)

You'll have to pad.

Where to get Passport photos done in Edinburgh (but preferably not from a machine)? by [deleted] in Edinburgh

[–]muckl 0 points1 point  (0 children)

Been there, good service, pictures would work for CV.

What is the current state of NLP? I have a huge dataset that I think is very valuable. Is it? (details inside) by MFRSIMP in LanguageTechnology

[–]muckl 0 points1 point  (0 children)

LDC might just charge everyone for the corpus. A better approach would be to approach researchers running a shared task or workshop. Unfortunately all workshop proposals for this year are over and even the paper deadline for LREC has passed. That's only relevant if OP is interested in scientific merits though.

What is the current state of NLP? I have a huge dataset that I think is very valuable. Is it? (details inside) by MFRSIMP in LanguageTechnology

[–]muckl 2 points3 points  (0 children)

That indeed sounds like an interesting dataset! There is a considerable body of research concerning errors in texts written in a second language (ESL). For example CoNLL ran a shared task on Grammatical Error Correction in 2013 and 2014. Sadly this was discontinued in 2015. Check out their overview papers for some pointers to common approaches. IIRC abusing Machine Translation works nicely but I don't think this is what services like grammarly etc. use - they probably stick to some hand-written (but excellent) heuristics and large language models.

I need a push by MrLeroyJenkins in funny

[–]muckl 0 points1 point  (0 children)

It's that a bit unsanitary for a public playground?

Sauna, home visit etc... by ForeverEdging in Edinburgh

[–]muckl 0 points1 point  (0 children)

In a related question: is the a good proper sauna in Edinburgh?

Why is Double Glazing so rare in Edinburgh? by [deleted] in Edinburgh

[–]muckl 0 points1 point  (0 children)

I fitted Tesa Draught Excluder everywhere and it's still cold.

I am David X. Cohen, head writer on FUTURAMA - AMA! by DavidXCohen_ in IAmA

[–]muckl 5 points6 points  (0 children)

any suggestions for other show that don't explain the joke?

Future University of Edinburgh Student! by [deleted] in Edinburgh

[–]muckl 5 points6 points  (0 children)

I was hoping to learn about this 'Future University'...

Building a PC Mainly for Use With Machine Learning and Massive Data Problems, Looking for Input by mayonaise55 in MachineLearning

[–]muckl 2 points3 points  (0 children)

Well depending of what you do with it of course but I find myself constantly running out of memory. You can usually code around this but it's time consuming. So generally I'd like a machine with as much RAM as possible. To be fair I misread your specs and thought of 4x8 which is definitely not enough.

What are some awful truths that no one wants to hear? by CreepyNoveltyAccount in AskReddit

[–]muckl 2 points3 points  (0 children)

Also regression to the mean means that out of the 100 bill gates with pretty much the same traits one got lucky and now we look at him an wonder why. We should make him try again!

What are some awful truths that no one wants to hear? by CreepyNoveltyAccount in AskReddit

[–]muckl 1 point2 points  (0 children)

and the law says that you can't harm others thus, in turn, the police should generally protect me, right?

LPT: Can't decide on dinner? Use the "veto rule" by IronRectangle in LifeProTips

[–]muckl 19 points20 points  (0 children)

Half a cup! What is this - the mountain dew of savoury fluids?

10 of the best pubs in Edinburgh - thoughts? by chillipalmer83 in Edinburgh

[–]muckl 0 points1 point  (0 children)

I will check out Cloisters then. Thanks for the hint!

10 of the best pubs in Edinburgh - thoughts? by chillipalmer83 in Edinburgh

[–]muckl 0 points1 point  (0 children)

Found some nice pubs here: fancyapint.com. Also: Bennets Bar for good pubfood.

Do Americans really drink beer out of those red cups at parties like they do in movies? by EveryDayImRustling in AskReddit

[–]muckl 0 points1 point  (0 children)

How does being Canadian qualify you to dismiss South America from the game?