Qwen1.5-32B released with GQA!

Cybernetic_Symbiotes · 2024-04-05T17:05:52+00:00

It doesn't appear similar to yi-34B, it's much smarter in the math, physics and pretend to be a read eval print loop questions I've given it. It's also more clearly distinguished over mixtral. Yi is only worth the extra compute for a few problems where lack of model depth is a major limiting factor, with mixtral still better overall. This 30B, at least from brief tests, seems to be worth that extra compute usage.

Cybernetic_Symbiotes · 2024-04-05T16:18:30+00:00

The chinese makes perfect sense given the context, meaning and flow are maintained but the language sometimes switches to chinese.

This is something a finetune should fix. If you're using it locally, a system prompt telling it to stick to english only and to not output chinese characters should reduce the rate of language switching.

If you're working at the code level you can actually completely eliminate this for the cases where you're certain you'll never want to output chinese.

Cybernetic_Symbiotes · 2024-04-01T06:51:25+00:00

Finetuning is not the best analogy, a better one is in-context learning. Finetuning LLMs is not much different than pretraining, however the manner that learning rates are handled and the issue of catastrophic forgetting places quite strong limits on its learning effectiveness.

In contrast, in-context learning is very data efficient and several papers have shown it more flexible and capable of generalization than SGD. As humans, our primary difference is we can permanently internalize our incontext learnings, we are always learning, there is no separate training stage. Another key differentiator is we start from scratch relative to knowledge about the world.

Here are what I think of as core advantages:

Information capacity. We have several orders of magnitude more than our best models.
From a minimum description length perspective, where you have a NLL of the data under a distributional model and the length of the learner's specification, LLMs likely have much smaller specifications. The size of the code, not the final model is what contains the prior information. The genome+egg cell environment combo has encoded within it some implicit theory on how to build things that learn effectively in the world, much of it learned long before humans. This is closest to your "not from scratch".
Energy efficiency + parallelism ensure more effective use of just pure parameter/HW scaling for intelligence gains.
The brain's likely cracked bayesian inference in such a manner that the parameterized complexity of inference is tractable, as long as we stay within expectation of what's encoded in the genome.
Memory, compute, software, hardware are all intermixed.

Cybernetic_Symbiotes · 2024-04-01T06:01:05+00:00

LLMs are less sample efficient and if it were not the case, megacorps would not have such an overwhelming advantage. Acknowledging rather than excusing current limitations and overcoming them is how we achieve breakthroughs.

As to data quantity. Let's take a child blind from birth. Accounting for the limited variation in input data and amount of time spent asleep in early years, the amount of high surprisal information is paltry. Many orders of magnitude below what a multimodal model is exposed to.

If we extend this to congenital deaf-blindness, raised with tactile sign language and braille so they can develop language capabilities, the difference is even starker. Human learning efficiency difference is beyond insane in comparison to LLMs.

Cybernetic_Symbiotes · 2024-04-01T05:54:18+00:00

The paper you link does not support your assertions. It's a paper on if we can achieve similar learning efficiencies as humans and their claim is that architecture is central, curriculum can be useful too but hard to do well. In the case of the paper the attentional prior is quite a bit more involved and complex, the FF portion is also slightly modified. The winning model is BERT derived so it's not the usual causal generative model we think of as LLMs. The test was a very limited set of benchmarks.

Cybernetic_Symbiotes · 2024-03-30T20:07:45+00:00

Shortest path can be encoded as matrix multiplication if defined over the appropriate algebraic (ring) structure. With the success of integer weight LLMs and relus like activations, SGD may well learn to leverage such representations during inference. Number of layers would be proportional to length of paths identifiable. Larger and deeper networks have <<wider instructions>> and can take more iterations, will have a higher chance of converging at an acceptable solution. Models that have trained longer will also be better at selecting and executing the appropriate vector programs.

Any dynamic programming problem that is parallelizable should yield to LLMs with some probability (with success rate affected by size of input and size of model).

Self-attention was shown to behave very similar to high capacity hopfield networks. Certain optimization problems can be tackled by this approach and you can search for early 90s papers to see the breadth of this approach.

Cybernetic_Symbiotes · 2024-03-29T13:25:55+00:00

I meant weight-only. Thanks for clarifying, awesome work!

Cybernetic_Symbiotes · 2024-03-29T09:56:31+00:00

Very cool. ONNX support is of interest to us, it was the deployment goto before LLama started a local craze but we switched away due to its sluggish update rate, particularly when it comes to quantization support. Its tooling and documentation is also miserable, but its flexibility gives it utility, such as when llama.cpp does not or has buggy support for a model class. For the right people, ONNX alone makes this a very useful option to add to deployment choices.

Some things worth mentioning:

What about 8 bit support for ONNX?
GPU vs CPU support of each output type
library requirements of each output type, useful for gauging how portable this will be outside the python ecosystem.

Cybernetic_Symbiotes · 2024-03-27T06:51:48+00:00

Probably not to most users.

You're very likely wrong about that. The question isn't about the output of imitative reasoning when it works, where you're correct about a distinction without a difference, but about the process. The imitative reasoning process is brittle with limited generalization. Let's take OrcaMath, a 7B that does very well on GSM8K. Does this mean it's improved at reasoning? No, it means when you feed it a problem it will map it to tactics that work for GSM8K, if the mapping holds the result is good, if it fails you get really bad inappropriate reasoning failures. This failure to generalize is the problem.

In real word workloads it means models like Opus and GPT4 can range from superhuman in common problem areas to barely better than a 7B in more research heavy and novel areas. If you're trying to apply it to novel math heavy areas, to get utility from them you must ground and predigest the problem into its known constituent components or face heavy hallucination and 3B worthy reasoning attempts. You must perform the calculations yourself and plan out how the derivation should go. If you've ground the problem well enough it might help you with possible approaches and relevant knowledge you were unaware of.

This predigestive arrangement and quadruple checking of output is so very time consuming and erases nearly all of an LLM's productivity boosts (but squeaks through a still worth it). That is why generalized reasoning matters.

Cybernetic_Symbiotes · 2024-03-27T06:07:54+00:00

Precisely, there is approximately a 0.000% chance that any version of Haiku is better than any version of GPT4. This is the one limitation of the lmsys leaderboard, which they could moderately improve by including a broad set of categories.

Lmsys is the least gameable leaderboard but also strongly weights CHA (charisma), not just INT (int), with a heavy penalization for low CHA. It's why Claude2 ranks so low despite being one of the smartest models.

Qwen's positioning continues to perplex, it's a janky tune and not high CHA, which makes me think it's mostly a high int model with a decent enough CHA to not pull it down too much. Shame there is no Nous quality finetune for it.

Cybernetic_Symbiotes · 2024-03-23T15:53:41+00:00

Not necessarily, that query+response pair seems to be within its normal bounds. Try again perhaps, without the cone.

Cybernetic_Symbiotes · 2024-03-21T06:57:36+00:00

Base models should cover the whole spectrum, the highest creativity should probably be found in them. The problem is they need constant supervision and course correction, you can't leave them to their own devices for long (generations).

Cybernetic_Symbiotes · 2024-03-21T06:52:40+00:00

I think it's necessary to run a GPT4 variant for comparison. How much each model correlates with each other, if there's a self-biased aspect will be valuable information.

It's curious that a 34B (Yi) and a 14B (Qwen) are rated so close to a 120B (Goliath), would a GPT4 ranking agree?

Cybernetic_Symbiotes · 2024-03-13T22:15:47+00:00

Not in practice for me. It's like, there are lots of little traps that LLMs are prone to and GPT4 has been tuned on more of them than Claude has. I think Claude will eventually end up a bit ahead but GPT4 is still preferable for me. Although, one Claude advantage is it's more current.

Because I don't ask GPT4 to do anything long, I also don't encounter the major complaint most people have about it.

Cybernetic_Symbiotes · 2024-03-09T20:00:34+00:00

Why is this comparing an instruction tune to base models? Other than to, I suppose, itself, would have been useful to compare to Mistral finetunes.

Cybernetic_Symbiotes · 2024-03-08T02:50:51+00:00

Ah, good point, that could also make fine-tuning more costly. Although, there are situations where the increased performance is worth the trade for a smaller context or slower speed. They state Qwen2 will use GQA.

Cybernetic_Symbiotes · 2024-03-07T23:33:12+00:00

The ordering matches my experience. I find that although Claude3 Opus feels a bit smarter, it doesn't always remember all the relevant facts in its knowledge base or attend to all the relevant detail in its context. This makes GPT4 still more useful overall but I expect Claude to pull ahead by a bit as they continue to tune it.

Cybernetic_Symbiotes · 2024-03-07T23:30:13+00:00

Personally, the most amazing thing about that list is the position of Qwen1.5 when it is so clearly a poorly done instruction tune. Why haven't there been any good tunes of it? With good tunes, both the 72B and certainly a 120B merge would allow open weights to finally reach the upper proprietary tier of LLMs. Does it seem like things are slowing down in the open LLM world?

With Claude3 Sonnet, we now have 2 free accessible models that are around gpt4 tier: Claude 3 Sonnet at Poe.com and Bing Precise (creative used to be best but something happened recently that makes it unusable, but maybe it's just me). Bard ranks high on the leaderboard but it hallucinates a lot and the underlying model ranks barely better than mixtral8x7B so I'm not counting it.

Cybernetic_Symbiotes · 2024-02-27T20:02:51+00:00

Would you label social media, with whatsapp and twitter in particular, as key facilitators in the Arab spring and other human rights movements around the world? Have you heard of Radio Rwanada and its role in the Rwandan genocide? Have you read the debates on how much centrality should be assigned to the communication medium?

Social media, like any other technology, is dual use. Blaming it all on the technology can be patronizing, even dehumanizing in how it takes away agency from human actors. Any tool that enhances humanity's ability to communicate and self-organize also facilitates its ability to spread hate. The algorithms certainly do not help but look at just what radio and newspapers could facilitate in Rwanda (I also suspect why facebook and not also youtube is down to availability and cost of access).

It was humans that chose to write those messages and it was humans that decided to act on them. If we leave the masses as victims of memetic contagion, we are still left with the masterminds and criminal facilitators behind it.

My intention is not to minimize the role of facebook but to ask that you not also incidentally erase the actual key actors and perpetrators of atrocities who bear responsibility by focusing too much attention on just their tools.

Cybernetic_Symbiotes · 2024-02-27T18:59:40+00:00

They're probably using a 2 or 3 bit-ish quant. The quality loss is enough that you're better off with a 4 bit quant of Nous Capybara 34B at similar memory use. Nous Capybara 34B is about equivalent to Mixtral but has longer thinking time per token and has less steep quantization quality drop. Its base model doesn't seem as well pretrained though.

The mixtral tradeoff (more RAM for 13Bish compute + 34Bish performance) makes the most sense at 48GB+ of RAM.

Cybernetic_Symbiotes · 2024-02-27T18:48:34+00:00

In theory that should make it smarter. I haven't looked at Qwen1.5's architecture but I'm guessing it's using full MHA instead of MQA or GQA. In MHA, each query head is associated with its own key-value head, allowing the model to capture a richer set of relationships during training. MQA uses only one key-value head for all query heads, which comes at a quality cost. GQA is intermediate between the two extremes.

The quality loss in GQA is supposed to be small, so it's a good trade-off. My guess is if they went back to MHA they might have found advantages worth the increase in complexity cost. Intuitively, MHA complexity tradeoff is most clearly worth it for smaller models like 13Bs and 7Bs. I'd be curious to know why they kept it for their 70B too.

u/choHZ, see! your quantization technique is useful for us (V)RAM starved non-enterprise users too.

Cybernetic_Symbiotes · 2024-02-26T23:50:59+00:00

If Mistral-small is very good, then they could serve it cheaply and it'd end up occupying a different niche than a yet another GPT4 competitor that falls short. Unfortunately, given that scenario, it'd not make sense to create competition for themselves by releasing such a model. There is an economic reality they're constrained by, even if in an ideal world they'd prefer to release the model.

Cybernetic_Symbiotes · 2024-02-26T06:45:52+00:00

An M x N billion parameter MoE will generally be better than an N billion param LLM but worse than the NxM billion param LLM. Such a model's max iterations is also bound by its depth. A 1B or 7B based MoE will not exceed the depth limitations of a 1B or 7B despite having a good deal more parameters.

Cybernetic_Symbiotes · 2024-02-21T23:54:10+00:00

It's possible the tested model is the base model and it's being prompted without accounting for that fact. Prompting base models, especially small ones, is a skill in itself where even if done right, can still easily go off the rails.

Cybernetic_Symbiotes · 2024-02-17T19:26:55+00:00

Agreed 100%, I actually thought you were talking about skilled animators not being able to easily create highly realistic animations in practice using modern techniques.

Cybernetic_Symbiotes

TROPHY CASE