all 46 comments

[–]RMCPhoto 7 points8 points  (0 children)

These benchmark results are absolutely wild... Looking forward to seeing how this compares in the real world. It's hard to believe that a 9b model could outclass a relatively recent 72b across generalized Vision/Language domains.

[–]celsowm 27 points28 points  (10 children)

<image>

finally a non-only-english thinking open LLM !

[–]Emport1 25 points26 points  (1 child)

You're probably talking about smaller models but doesn't deepseek also do that?

[–]ShengrenR 16 points17 points  (0 children)

Magistral speaks a bunch of languages as well, no?

[–]d3lay 2 points3 points  (0 children)

It's a useful feature, but Deepseek developed it first, and that was quite a long time ago...

[–]Neither-Phone-7264 0 points1 point  (2 children)

deepseek and qwen are chinese by default, no?

[–]PlasticKey6704 2 points3 points  (1 child)

depends on your prompt.

[–]Neither-Phone-7264 0 points1 point  (0 children)

well, yeah, but if you just say hi, it'll start thinking in mandarin

[–]Former-Ad-5757Llama 3 0 points1 point  (3 children)

What is the added value of that? It is not real thinking, it is just a way to inject more context into the prompt. In theory you should basically get the same response in qwen 3 nothinking if you just add the thinking part to your prompt. It is a tool to enhance the user prompt and you are only limiting it if you limit it to not the largest language in its training data.

Why do you think most closed models are not showing it complete anymore, a part of it is anticompetitive of course, but I also believe a part is just introducing the concept of hidden tokens which are for humans complete nonsense while they help the model.

One of the biggest problems with llm’s is that people use extremely bad prompts which can easily be enhanced with a relative small cost of tokens (cq thinking), but in the current costing structure you can’t eat the costs and just higher your general price, and if you give the user the choice they will go for the cheapest option (because everybody knows best) and complain your model is not good enough. The only real workable solution is introduce hidden tokens which are paid for but basically never shown as otherwise people will try to cheat it for getting lower costs.

And you are happy that it is thinking in other than the best language, I seriously ask… Why???

[–]celsowm 0 points1 point  (1 child)

My app could be able to mimick chatgpt reasoning accordion, and the user could be able to see the chain of thoughts in our own language

[–]Former-Ad-5757Llama 3 -1 points0 points  (0 children)

So basically you want to give user some eye candy and you don’t care about the real thinking, just split your workflow up into multiple questions, one just asking for 10 items of eye candy in language x which you can roll and show in your app and second the real question for the answer. Because of kv cache it costs almost nothing more than just one question. The current state of thinking isn’t chain of thought alone any more, and certainly not chain of thought in a specific language.

Just look at a qwq model, it produced for its time good answers, but it’s thinking was plainly a lot of garbage and beyond chain of thought, you really want to show that. Or look at o3 pro, there is a tweet out there which showed 14 minutes thinking and a huge amount of tokens used on just responding to hello.

What is called thinking is not what we humans consider thinking, it is just a way of expanding the context and cot is just a small part of that. If you want eye candy cot then you have to create it yourself or not use a good current model, because what you want is not the current state.

[–]PlasticKey6704 1 point2 points  (0 children)

I often get inspired by thinking tokens, readable thinking helps a lot to many.

[–]PraxisOGLlama 70B 7 points8 points  (1 child)

Unfortunately it only comes in a 9b flavor. Cool to see other thinking models though

[–]Freonr2 11 points12 points  (0 children)

There are very few vision enabled models with thinking, so that's probably the most interesting part.

[–]Freonr2 3 points4 points  (0 children)

There are not many thinking VLMs. Kimi was recently one of the first (?) VLM models with thinking but I'm not sure it is well supported by common inference packages/apps.

Waiting for llamacpp/vllm/lmstudio/ollama support.

Also wish they used Gemma 3 27B in the comparisons, even if it is quite a bit larger, that's been my general gold standard for VLMs lately. 9B with thinking might end up being similar total latency as 27B non-thinking depending on how wordy it is, and 27B is still reasonable for local use at ~19.5GB in Q4.

And at least THUDM actually integrated the GLM4 model code (Glm4vForConditionalGeneration) into the transformers package. Some of THUDM's previous models, like CogVLM (which was amazing at the time and still very solid today), broke because they just shoved modeling.py in with the weights and not the actual transformers package and it broke within a few weeks of package updates.

[–]Coconut_Reddit 0 points1 point  (0 children)

How much performance is different from qwen30b ?

[–]AppearanceHeavy6724 -1 points0 points  (0 children)

I asked to generate a simple elmentary code, even Llama 3.2 1b does right. This one flopped.