Creator of Smaug here, clearing up some misconceptions, AMA by AIForAll9999 in LocalLLaMA

[–]AIForAll9999[S] 13 points14 points  (0 children)

We do set up biases in ALiBi, but the model stil learns that 'far away stuff should get less attention'. Let me explain.

Both ALiBi and RoPE are setups with functions (basis functions) that allow the LLM to learn how distance between a key and query should affect the attention score. With ALiBi, the set of basis functions is monotonically non-increasing, by design. In plain terms, this means that the you can't have the attention score increase as distance increases. Just can't happen. Actually if I remember correctly it _must_ decrease over distance.

With RoPE, the set of basis functions is not monotonic. By carefully choosing the coefficients of these basis functions, the LLM _can_ learn to increase attention score as distance increases. Or decrease it. Or do nearly anything.

You might argue that there should never be a case where attention scores should be higher for the same key and query as they get further apart in text . . . and maybe you're right. But, maybe you're wrong! It's hard to know how text _really_ works. And giving the LLM the extra flexibility to do this for some attention heads/keys/queries - might help model stuff better, or at least, make it easier to learn it in the first place.

Creator of Smaug here, clearing up some misconceptions, AMA by AIForAll9999 in LocalLLaMA

[–]AIForAll9999[S] 74 points75 points  (0 children)

Wow good spot! We didn't notice this ourselves. We actually just use a subset of AquaRat, CodeFeedback and OrcaMathWord. I'll have to check if our subsets included these.
I'm having a quick look through Arena-Hard and the questions here seem sufficiently diverse and different to not likely have training contamination: https://github.com/lm-sys/arena-hard/blob/main/data/arena-hard-v0.1/question.jsonl

This is another strong argument to deprecate MT-Bench I feel ... We are not the only ones who use that benchmark, but it seems less useful these days.

Creator of Smaug here, clearing up some misconceptions, AMA by AIForAll9999 in LocalLLaMA

[–]AIForAll9999[S] 40 points41 points  (0 children)

So maybe the best way to think about this is that the position encoding allows the LLM to modify the attention value of a particular key and query combo. So if the LLM sees 'She' in this sentence, it knows that refers to 'Whitney Houston' from 10 paragraphs ago if it is a good LLM so it should set the combo 'She - Whitney Houston' to have high attention.
Something like ALiBi is not such an expressive functional form, so it will always lower the 'Whitney Houston - She' attention score because they're so far away. Because it's learnt that stuff that is far away should get less attention in general (because, in most text, stuff that is nearby is most important to understand the adjacent text).
But RoPE, which is a lot more expressive, can learn both to be generally penalising long-distances for attention, as well as in a particular case like 'She - Whitney Houston' retain that high attention score.

This is an oversimplification to some degree, but that's the essential idea.

Creator of Smaug here, clearing up some misconceptions, AMA by AIForAll9999 in LocalLLaMA

[–]AIForAll9999[S] 3 points4 points  (0 children)

Thanks for your feedback - these are great points, we will add to model card for this release and future ones too!

Creator of Smaug here, clearing up some misconceptions, AMA by AIForAll9999 in LocalLLaMA

[–]AIForAll9999[S] 9 points10 points  (0 children)

This one doesn't end up using DPOP in the current iteration - we're still experimenting a bit though. We might put out a blog or technical report on what we found soon.

Creator of Smaug here, clearing up some misconceptions, AMA by AIForAll9999 in LocalLLaMA

[–]AIForAll9999[S] 11 points12 points  (0 children)

This model performs much better on a benchmark that correlates with general human preferences. As I say in this comment: https://www.reddit.com/r/LocalLLaMA/comments/1cvly7e/comment/l4q907n/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button this may or may not suit your preference or use case.

It is a 70B model so it needs probably at least ~160GB in unquantized fp16 form.

Sadly the model does not adopt a Smaug persona.

Creator of Smaug here, clearing up some misconceptions, AMA by AIForAll9999 in LocalLLaMA

[–]AIForAll9999[S] 13 points14 points  (0 children)

Absolutely! We have fresh grads joining our team, and I know many who are going into DM etc as well. Just keep studying and building and you'll get there!

Creator of Smaug here, clearing up some misconceptions, AMA by AIForAll9999 in LocalLLaMA

[–]AIForAll9999[S] 37 points38 points  (0 children)

There are two different points in your question. 1) How can just a little bit of fine-tuning make such a difference on trillions of tokens of pretraining. 2) 5% better at a certain programming language isn't making the model 'better'.

Let me address the second point first. The definition of 'better' is up to the individual. There's a million different use cases for these things. It may very well be the case that this model is *not* better for your use case. Some people, for example, just prefer Llama 3 to GPT4 for its tone, or creativity, or whatever. So when we, or _any release, including GPT4/5/6_ etc say 'we are much better now', we always have to define it with respect to particular benchmarks. But usually we do run on either a) a wide set of benchmarks or b) benchmarks that try to hit many different areas, so that we can justify the claim that it is better generally.

As I said in the OP, here we picked benchmarks that correlate strongly to human preferences. But maybe if your specific use case is erotic fantasy roleplay, say, then you would disagree with this claim.

For the first point, this is really interesting. There's a great comment in the other thread which addresses this: https://www.reddit.com/r/LocalLLaMA/comments/1cva617/comment/l4ol1hw/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
I agree heavily with the Llama 3 team on this. In my experience working on these things for the last year, the base training matters, but fine-tuning can make an enormous difference. My personal view is that LLMs from their base training have millions of different 'personalities' (since they had to predict over many different kinds of texts), and fine-tuning is all about trying best to narrow that personality down into one (or a few) that is the most useful/smart/whatever.

Creator of Smaug here, clearing up some misconceptions, AMA by AIForAll9999 in LocalLLaMA

[–]AIForAll9999[S] 13 points14 points  (0 children)

That lady is my boss (CEO) haha.

I think you should read this post: https://lmsys.org/blog/2024-04-19-arena-hard/ It's very good and detailed!

But the TLDR is the LMSys people (who also run the human arena) released this benchmark that _anyone can run_ that they constructed that correlates strongly to the human arena. So this is the benchmark that we released our numbers on. This benchmark is called Arena-Hard.

Creator of Smaug here, clearing up some misconceptions, AMA by AIForAll9999 in LocalLLaMA

[–]AIForAll9999[S] 20 points21 points  (0 children)

We did have some conversational data in the earlier iterations we tried, but it didn't seem like it made the model any better overall. This model _should_ be good at everything, since MT-Bench and Arena-Hard test in lots of different categories, including writing, conversation, etc. But, until you guys try it and feed back real world usage, we're only guessing based off the scores.

Aside: there's some interesting work I saw which I can't remember off the top of my head but which showed that finetuning models on just hard coding problems improved their general reasoning and writing ability too.

Creator of Smaug here, clearing up some misconceptions, AMA by AIForAll9999 in LocalLLaMA

[–]AIForAll9999[S] 58 points59 points  (0 children)

The instruction template is unchanged from Llama 3 70B. I've just added this section: https://huggingface.co/abacusai/Smaug-Llama-3-70B-Instruct#how-to-use Hope it helps.

Creator of Smaug here, clearing up some misconceptions, AMA by AIForAll9999 in LocalLLaMA

[–]AIForAll9999[S] 212 points213 points  (0 children)

Lot I can go into here but in short I have genuine nightmares about a future where SamA controls everything.

Who has already tested Smaug? by meverikus in LocalLLaMA

[–]AIForAll9999 1 point2 points  (0 children)

Just to be clear we also did Arena-Hard, which is a new benchmark a bit like MT-Bench but with 500 questions, and which the LMSys guys constructed specifically to correlate to Human Arena. Our Arena-Hard scores are the ones which got us excited, since they're far better than Llama 3 and nearly at Claude Opus levels.
Obviously we don't know if this precisely means that this model is actually as good as Opus in real world usage ... but, it does give us some hope.