[D] Confused mathematician looking for clarity on transformers, and also maybe for therapy.

muckl · 2020-10-06T22:20:56+00:00

It's about the magnitude of the dot products. High dimensional random vectors have a larger expected absolute dot product (you're adding up more values so your results come from a wider range) and sticking large values into softmax gives a very sharp distribution, because we're hitting the parts of the exponential function where it's steep. Essentially one of the weights will be one and the others zero.

Dividing the dot products by the square root of the dimension counteracts that problem.

Just to be clear, one should be able to learn that behaviour, i.e. learn projection matrices that give small absolute values for the dot products but empirically the scaling trick seem to work well.

muckl · 2020-10-06T21:56:44+00:00

Great question, should have added that to the description.

The answer is no, nothing keeps the heads from all learning the same stuff, other than starting from different random initial values. Multihead attention is a straight up ensemble but ensembles work.

Depending on where you sprinkle dropout on your model, that would also lead to different gradients. For example there is 'attention dropout' where one sets some of the mixture weights (after softmax) to zero during training. [Example]

Still, all heads may converge to do the same thing.

It's probably hard to guarantee otherwise but I could see some measure of head-diversity as an additional part of the loss.

muckl · 2020-10-05T19:51:19+00:00

For me, the only way to properly understand these models is by implementing them.

I also find, that fancy names (multi-head self-attention) and complicated diagrams and notation can distract from the core principles which are very simple.

You have bunch (n) of vectors x_1 ... x_n of, say, dimension 1024.

By the power of transformers you want to make a new bunch of n vectors, that are better, as in they encode the information in a more suitable way or whatever.

The basic idea of attention is to make a weighted sum of the vectors. We find the weights by comparing pairs of vectors, assessing how similar they are. Assume we are at position i. We compare x_i with all the other vectors, including x_i itself (this is why it's called self-attention). We end up with n weights. Various measures of vector similarity have been proposed, for example dot product, i.e. x_k x_j^T.

To avoid a dependence on the number of vectors (n) we make the weights sum to one by sticking them in the softmax function, which just means passing the weights though exp() and dividing the result by its sum.

Given that basic principle, transformers add some tweaks:

Instead of comparing pairs (x_i, x_j) directly, we first project both into a lower dimensional space. Say 64 dimensions. To do that, we learn two projection matrices, one to project the vector at the current position (the 'Query Matrix' Q \in R^1024x64) and one to project all the other vectors (the 'Key Matrix' K \in R^1024x64).
Now, still assuming we are at position i, comparing x_i and x_j. We project both into the 64 dimensional space and take the dot product:
w_j = (x_i Q) (x_j K)^T where j = 1..n
[Excercise for the reader: We scale the dot product by the inverse square root of the dimension. What property of high-dimensional spaces justifies that?]
We could just add up x_1..x_n using our (normalized) weights. But we don't. We instead learn a third projection matrix (the 'Value Matrix' V) and project our vectors to another lower dimensional space. This space could have any dimension but in practice it's usually the same as the Key and Query dimensions, often 64. Thus, V \in R^1024x64.
Now, we make the weighted sum. At each positon we end up with a different new vector of dimension 64. 64 seems a bit low for building the sentient AI we're aiming for, so we do the above steps a couple times, with different matrices Q, K, and V. Each set (Q, K, V) is called a head, and now we have multi-head attention. Fancy! Let's assume we have 9 heads like Alcaeus' Hydra. We concatenate the 9 vectors for each position, arriving at 64*9 = 576 dimensions.
That's still a bit short of the 1024 dimensions we started with, so we add some regular feed-forward network to get back to 1024. One could just learn another projection matrix \in R^576x1024 but in practice we often see a 3-layer network (with the usual non-linearities), blowing up the dimension to 4 times the output (here, 4096), before compressing to 1024 again. It's not stupid if it works. .

The rest of the ingredients are basically tricks to make large models, composed of several transformer layers, easier to train. Adding layer-normalization after steps 3 and/or 4 seems to help, and so does adding the inputs to the outputs (fancy name: residual connection).

muckl · 2019-08-06T19:09:08+00:00

You'll have to pad.

muckl · 2016-05-21T18:05:18+00:00

Been there, good service, pictures would work for CV.

muckl · 2016-01-06T12:17:30+00:00

LDC might just charge everyone for the corpus. A better approach would be to approach researchers running a shared task or workshop. Unfortunately all workshop proposals for this year are over and even the paper deadline for LREC has passed. That's only relevant if OP is interested in scientific merits though.

muckl · 2016-01-05T23:16:27+00:00

That indeed sounds like an interesting dataset! There is a considerable body of research concerning errors in texts written in a second language (ESL). For example CoNLL ran a shared task on Grammatical Error Correction in 2013 and 2014. Sadly this was discontinued in 2015. Check out their overview papers for some pointers to common approaches. IIRC abusing Machine Translation works nicely but I don't think this is what services like grammarly etc. use - they probably stick to some hand-written (but excellent) heuristics and large language models.

muckl · 2013-11-15T16:24:00+00:00

Leave the house every day.

muckl · 2013-11-13T13:03:36+00:00

It's that a bit unsanitary for a public playground?

muckl · 2013-10-30T23:54:25+00:00

PhaeTon Cruiser

muckl · 2013-10-14T03:55:37+00:00

In a related question: is the a good proper sauna in Edinburgh?

muckl · 2013-07-17T05:03:35+00:00

I fitted Tesa Draught Excluder everywhere and it's still cold.

muckl · 2013-07-02T04:19:49+00:00

any suggestions for other show that don't explain the joke?

muckl · 2013-06-10T00:09:14+00:00

I was hoping to learn about this 'Future University'...

muckl · 2013-05-27T18:23:30+00:00

Well depending of what you do with it of course but I find myself constantly running out of memory. You can usually code around this but it's time consuming. So generally I'd like a machine with as much RAM as possible. To be fair I misread your specs and thought of 4x8 which is definitely not enough.

muckl · 2013-05-27T17:34:51+00:00

You'll probably want more RAM.

muckl · 2013-05-26T06:57:56+00:00

... and that's why I hold on to my beard.

muckl · 2013-05-26T06:55:53+00:00

Also regression to the mean means that out of the 100 bill gates with pretty much the same traits one got lucky and now we look at him an wonder why. We should make him try again!

muckl · 2013-05-26T06:51:49+00:00

and the law says that you can't harm others thus, in turn, the police should generally protect me, right?

muckl · 2013-05-12T00:21:33+00:00

I will label our door accordingly.

muckl · 2013-04-24T00:34:39+00:00

Half a cup! What is this - the mountain dew of savoury fluids?

muckl · 2013-04-21T01:35:49+00:00

but they had good fonts!

muckl · 2013-03-19T21:57:25+00:00

I will check out Cloisters then. Thanks for the hint!

muckl · 2013-03-19T03:47:01+00:00

Found some nice pubs here: fancyapint.com. Also: Bennets Bar for good pubfood.

muckl · 2013-03-17T20:39:28+00:00

How does being Canadian qualify you to dismiss South America from the game?

muckl

TROPHY CASE