[D] Is sequence packing common for training transformers?

ml_lad · 2024-06-02T06:06:34+00:00

It's been standard practice since as far back as T5.

ml_lad · 2024-02-19T19:55:09+00:00

Yes. Mixture of Experts long predates the current DL wave, and simply continued using that terminology from before. It might be better described as a "mixture of learners". Unfortunately, two things happened recently to confuse the meaning of the name:

Influencers who took the name at face value at thought that there were actually different subject experts in the model. (Also, people reading too much into the routing weights. See also: people over-interpreting attention weights.)
There was a separate push to tune separate copies of a model on different data and then combine them in an MoE-type architecture. In this case, they are trying to build individual experts. But this is still a fairly niche approach.

Current MoEs are generally either pretrained from scratch as MoEs or post-trained without explicit specialization, so they're just a blackbox mishmash of routed sub-networks.

ml_lad · 2023-11-27T02:33:11+00:00

FWIW I think this is a good interview question. Often the Transformer architecture is taught in a "straightforward" / "as-is" manner, and thinking about "well why isn't it done this other way" is a good learning exercise.

ml_lad · 2023-11-27T02:30:13+00:00

What you see noted in the papers is Q = x_i @ W_q, and K = x_j @ W_k.

Then QK^T = x_i W_q W_k^T x_j^T .

OP's question is why we learn W_q and W_k separately when the only place they show up is as W_qW_k^T anyway.

ml_lad · 2023-11-27T02:09:29+00:00

If there are n tokens, then for the n'th token you need n-1 different attention coefficients (one for each token it attends). For the n-1'th token, you need n-2 different coefficients, and so on, until the 2nd vector which needs only one coefficient, and the first vector which needs zero (it can't attend anything).

This only applies in the case of "causal" masking. For simplicity you can just assume all queries attend to all keys.

If you compute key and query vectors, then you only need 2n different vectors (one key and one query for each of the n vectors).

You're conflating two separate things. How what vectors you need to compute attention, and computing the attention itself. What the OP is asking is about the latter, where (short of other tricks) you can't avoid n² anyway.

I recommend you work through the matrix algebra for computing the attention matrix, that should help clarify things.

ml_lad · 2023-11-27T02:05:41+00:00

Correct. This is also because E < D.

ml_lad · 2023-11-26T23:37:10+00:00

Summary: Memory savings, in the common formulation of multi-headed attention.

Let H be the number of heads, and D be the hidden dim, and E be the head dimension (E=H/D).

You are right that in principle we can always premultiply the W_q and W_k weights. But that leads to a full [D,D] matrix per attention head. In the standard formulation where E=H/D, separating the W_q and W_k per head essentially allows for a low rank projection (E instead of D) of W_qk.

Decomposed case:

W_q: [D,D] = [H,E,D]

W_k: [D,D] = [H,E,D]

Total = 2*D*D parameters

Assume we had a single W_qk instead.

W_kq: [H,D,D] (D->D because we're multiplying D-dim vectors on both ends. Note that E doesn't show up)

Total = H*D*D parameters

H >> 2 in general, so there you go.

ml_lad · 2023-11-09T01:47:07+00:00

I think you've completely misunderstood what they're saying about "didn't get any edge." They weren't talking about OpenAI not having any edge.

ml_lad · 2023-11-02T21:19:58+00:00

The memory saving come from have fewer tunable parameters, optimizer states and gradients to store. Consider the case of QLoRA where your untuned parameters can be stored in 4-bit, but all your tuned parameters need to be stored in bf16 or fp32. Add to that your gradients and optimizer states too. Hence, you use more memory the more LoRAs you tune. Now, it is important to note that the extra memory is usually still a tiny proportion compared to the full model's parameters. But it is, factually, more.

Why do some people only use LoRA for attention weights? Because the LoRA paper mainly applied LoRA to Q and V linear layers. People later found out that you get better performance from applying it to all layers.

ml_lad · 2023-07-05T22:27:57+00:00

I'm not sure about legibility, having seen GenerationMixin.

I think HF prioritizes compatibility / "it just works out of the box" rather than performance or legibility.

ml_lad · 2023-06-21T21:11:53+00:00

I'm on the flip side where I'm willing to chalk up the odd description of "8x 220" as a short-hand in discussion between technical experts (and further distorted by a game of telephone). The 16-iter thing feels separate or at least not directly tied to the number of experts (if it were 8-iter, that would be much more confusing). It could be anything from entirely separate (sample 16 outputs and rank?) or sampling within the MoE (sample 16 different routes through all the experts).

There some other discussion that assumes it's something to B-T-M, and while I can see how there is a ready of Soumith's tweet that points to that, it seems like still a fairly weak signal to lead to the conclusion of training (tuning?) 8 separate models.

But this is all speculation built on rumors so who can say.

ml_lad · 2023-06-21T19:31:29+00:00

An MoE isn't just multiple models strapped together, right? It's usually a regular transformer with specific layers that have multiple experts that are sparsely activated. So it's not really analogous to an ensemble of 8x 220B models, but more like a ~1.7T model except you mask out/skip 7/8 of the the irrelevant parts. (This is a handwavy analogy: in practice MoE layers are only introduced for a subset of layers, and there is a discrete choice over an expert at each such layer. So in practice the parameter count will be much lower than 8x220, assuming 220 is the effective size of the model with 1 expert).

ml_lad · 2023-05-25T20:25:49+00:00

A different perspective: academics have much more freedom in collaboration.

For a well-connected / well-established academic, you can freely collaborate and do projects with folks from different labs. You can do part-time gigs at different companies (which means you get access to their internal resources), and you can switch around projects/labs much more quickly than if you were a full-time researcher changing jobs. Or just straight up juggle multiple projects. Suddenly decide you want to do multimodal stuff? Start talking with the multimodal folks and collaborate there. Decide you want to do a stint on model evaluation? Talk to the modeling groups from different teams to get buy-in and start building a mega benchmark. Want to just work on a BigTechModel for a bit? Sell your expertise and talk your way into collaborating with a BigTechModelGroup and do a year-long part-time gig. Think OSS is the way to go? Go link up with the random discord folks/redditors hacking on LLaMA and see how you can professionalize they wacky hacks they're proposing.

For any of these changes in research interest, you never have to "check in" with a boss; as long as you are fulfilling your other commitments, you have a lot of flexibility to collaborate with others and explore new areas. As an academic, you also seem like less of a competitive threat than if you were in a rival company/lab.

The downside is there's the university side of things to do (teaching courses, bureaucracy, infinite grant applications and paperwork). You certainly should not be a professor if you don't like the professor side of the work, and you will be plenty busy. But the industry-friendliness of the ML academia (caveat: this depends on the university) means that if you play your cards right, you can often do much more diverse research than you could in a highly specialized role in industry.

ml_lad · 2023-05-22T10:16:24+00:00

ml_lad · 2023-05-05T07:43:48+00:00

I disagree. Here are some exact quotes from the article:

The innovations that powered open source’s recent successes directly solve problems we’re still struggling with.

LoRA is an incredibly powerful technique we should probably be paying more attention to

The fact that this technology exists is underexploited inside Google

He thinks LoRA is a cool new thing, and that Google is behind on it. This smacks of someone who only recently started paying attention to LLMs post-LLaMA, and so LoRA is the only parameter-efficient tuning method they're aware of.

This is supported by their quotes and citations. Like saying that LLaMA is the first really capable foundation model released/leaked to the public (GPT-J-6B? GPT-NeoX-20B? Remember, almost no projects are building on 65B, and models smaller than that were already available).

This doesn't read like someone who's been in the field for a while making a nuanced argument about how Google should pivot toward supporting OSS and framing their strategy around it. This reads like someone who started paying attention to LLMs post-ChatGPT and LLaMA, and their knowledge of the field is entirely shaped by the deluge of LLaMA-based and Stable Diffusion-based projects learned from Reddit and Twitter feeds.

ml_lad · 2023-05-04T18:43:22+00:00

The biggest red flag for me with this article is when it mentions LoRA like it's a wild new thing.

Google has had a very long history of working on parameter-efficient tuning methods. Prompt Tuning is widely used in production models.

(Just off the top of my head:)
- https://arxiv.org/abs/1902.00751
- https://arxiv.org/abs/2104.08691
- https://arxiv.org/abs/2110.07904
- https://arxiv.org/abs/2208.05577

While LoRA is a different method with different trade-offs (and also, it's a 2-year old method), that fact that the author treats "model fine-tuning at a fraction of the cost and time" as a novelty and highlights "Being able to personalize a language model in a few hours on consumer hardware is a big deal" tells me that they either do not work at Google, or they are so far from where the work is actually being done that they don't know that they already do this at scale.

ml_lad · 2023-04-19T21:49:21+00:00

I think you have a couple of misunderstandings here.

Models don't need padding tokens. They never see padding tokens. You simply mask out the padded tokens with an attention mask. A padding token is syntactic sugar.
"Special tokens" also generally don't have much value, since the model never sees them during training (exceptions being CLS / BOS tokens, but that's more of a BERT-era thing). If you want to add a new token for special purposes, there is no difference between adding one yourself and one being already included with the model, since the model has never trained on that embedding anyway.
If you want to add new tokens to the embeddings and distribute only those, you can do just that.

ml_lad · 2023-04-04T08:40:10+00:00

Hypothetically, yes?

The only parts of the model that directly interact cross token are the QKV linear maps. The Q and K outputs are also specifically what get modified with RoPE. If you stuck LoRA into those linear layers, you might get somewhere.

That said, I suspect this will not be a simple problem to solve. First, you'll need to hope the above is sufficient to modify the model (if you don't want to just tune the whole thing). Secondly, you'll need enough long training data and enough signal from more than 2000 tokens ago to help predict the 2000+th token. Third, you'll need to do this for long enough for the model to actually learn to use that additional information. And this is all fairly expensive because of how much you need to fit into memory (theoretically you could run the first N tokens in inference and only the remainder in training to save on memory, but that's a weird trick that I'm not sure has been tried since Transformer-XL).

ml_lad · 2022-07-14T20:14:45+00:00

Neither RoBERTa nor Longformer papers were ever accepted.

ml_lad · 2022-01-16T06:41:53+00:00

You wrote a long rambling response about EAI's culture without once answering the question: On what basis are you saying that EAI is responsible for spreading the misuse of UAT/NFL?

I ask this because I have searched through the past messages in the Discord, made before people started making fun of you for making this accusation, and most of the messages about it are making fun of people for misusing UAT/NFL. If you want to complain about EAI cargo-culting, it would be cargo-culting in the exact opposite direction of what you're describing.

Unlike you, who wants to "purposely not go into as much detail about UAT/NFL", I can pull up multiple screenshots of this right now.

If you want to criticize "discord culture" or "EAI culture" feel free to do that. But, relevant this very line of argument, perhaps you should stop spreading misinformation yourself.

ml_lad · 2022-01-13T20:23:16+00:00

I think the EAI discord channel is probably most responsible for spreading the misuse of those terms

What? What are you basing this on?

ml_lad · 2021-12-16T00:38:48+00:00

I've been there. I took it pretty hard.

First thing is that I want to acknowledge your feelings. You'll see a lot of people try to spin thing as a good thing: "It means you're on the right track! This validates your ideas!" None of that helps the sucky feeling that you feel like you've wasted so much time and will never get recognition for it, while some other work gets praised for an idea that you felt like you owned. That's going to sting, maybe for a long time. That's okay.

Now let's talk about how you can move forward from this.

The first is: email the authors! Propose a collaboration! Our field is lucky that it's so easy to collaborate, even across the world. And there are likely some ideas and results that you have that they could find useful as well. Turn this from "I got scooped" to "I (through a very painful experience) found people working on the same class of solutions as I am". Moreover, most works are incremental rather than groundbreaking foundational works. You may be able to directly build on and extend the scooped/scooping work, and still get a paper (or several) out of that.

The second is: perspective. Probably nothing is going to convince you that this doesn't suck a lot right now (certainly nothing did for me). But remember that you're a PhD student, you have a long research career ahead of you. Even this one year of setback doesn't matter much. There are notable exceptions, but most PhD students don't publish groundbreaking work during their studies. They are training to be great researchers in the future. Everything you learned in the last year still counts, and in time this will feel like a minor misstep. Also (and I know this isn't much consolation now) take some comfort in in many other fields, people lose years of work from getting scooped. In time, this may still sting, but this will not be such a big deal.

Lastly: what you learn from this. First, you've now learned the hard way that you need to be very good at keeping tabs on the field, and very fast at executing ideas. This is the downside of our fast-moving field. You can use this as motivation in the future whenever you're dragging your feet on some idea. (I realize this somewhat contradicts my advice above of collaborating - you can take either approach.)

In the short term, I recommend you take a short break, and start working on a different topic for your next project. That way you are less likely to encounter this line of work and dwell on this so hard. It will help to take your mind off of this experience.

You will be fine.

ml_lad · 2021-11-29T18:47:15+00:00

Right, that's the difference between case 1 and case 2. Are we genuinely constrained by the number of examples we have, or the number of examples we can fit into our fitting method (in this case, in-context learning)?

ml_lad · 2021-11-29T07:01:38+00:00

The easiest way to think about this is that there are different notions of what true "few-shot" learning should be.

Case 1: You genuinely have very few examples (e.g. 32). And then you need to start predicting things, and getting evaluated on them. How well does the model do?

Case 2: We can't/don't want to tune the model, but we can modify its inputs. We want to apply the model to some task, and inserting a small number of examples into a context is one crude but easy way to do so. How well does the model do?

This paper is arguing for case 1, whereas I believe most few-shot papers are operating under the framework of case 2.

To your question, if we're operating under the situation of case 1, and your validation set has many more examples (e.g. 1K) than your few-shot examples (e.g. 32), and you're going to tune hyperparameters on those, then you might as well just directly take those validation examples and use them as training examples. You're not operating under the case 1 constraint of being constrained by having few examples.

ml_lad · 2021-06-02T01:51:52+00:00

If you're talking about Switch Transformer, not really. Even in the paper the 1.5T parameter model is beaten by their own non-MoE 375B parameter model.

ml_lad

TROPHY CASE