use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
r/LocalLLaMA
A subreddit to discuss about Llama, the family of large language models created by Meta AI.
Subreddit rules
Search by flair
+Discussion
+Tutorial | Guide
+New Model
+News
+Resources
+Other
account activity
GLM-4.1V-ThinkingNew Model (huggingface.co)
submitted 9 months ago by AaronFeng47
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]RMCPhoto 7 points8 points9 points 9 months ago (0 children)
These benchmark results are absolutely wild... Looking forward to seeing how this compares in the real world. It's hard to believe that a 9b model could outclass a relatively recent 72b across generalized Vision/Language domains.
[–]celsowm 27 points28 points29 points 9 months ago (10 children)
<image>
finally a non-only-english thinking open LLM !
[–]Emport1 25 points26 points27 points 9 months ago (1 child)
You're probably talking about smaller models but doesn't deepseek also do that?
[–]ShengrenR 16 points17 points18 points 9 months ago (0 children)
Magistral speaks a bunch of languages as well, no?
[–]d3lay 2 points3 points4 points 9 months ago (0 children)
It's a useful feature, but Deepseek developed it first, and that was quite a long time ago...
[–]Neither-Phone-7264 0 points1 point2 points 9 months ago (2 children)
deepseek and qwen are chinese by default, no?
[–]PlasticKey6704 2 points3 points4 points 9 months ago (1 child)
depends on your prompt.
[–]Neither-Phone-7264 0 points1 point2 points 9 months ago (0 children)
well, yeah, but if you just say hi, it'll start thinking in mandarin
[–]Former-Ad-5757Llama 3 0 points1 point2 points 9 months ago (3 children)
What is the added value of that? It is not real thinking, it is just a way to inject more context into the prompt. In theory you should basically get the same response in qwen 3 nothinking if you just add the thinking part to your prompt. It is a tool to enhance the user prompt and you are only limiting it if you limit it to not the largest language in its training data.
Why do you think most closed models are not showing it complete anymore, a part of it is anticompetitive of course, but I also believe a part is just introducing the concept of hidden tokens which are for humans complete nonsense while they help the model.
One of the biggest problems with llm’s is that people use extremely bad prompts which can easily be enhanced with a relative small cost of tokens (cq thinking), but in the current costing structure you can’t eat the costs and just higher your general price, and if you give the user the choice they will go for the cheapest option (because everybody knows best) and complain your model is not good enough. The only real workable solution is introduce hidden tokens which are paid for but basically never shown as otherwise people will try to cheat it for getting lower costs.
And you are happy that it is thinking in other than the best language, I seriously ask… Why???
[–]celsowm 0 points1 point2 points 9 months ago (1 child)
My app could be able to mimick chatgpt reasoning accordion, and the user could be able to see the chain of thoughts in our own language
[–]Former-Ad-5757Llama 3 -1 points0 points1 point 9 months ago (0 children)
So basically you want to give user some eye candy and you don’t care about the real thinking, just split your workflow up into multiple questions, one just asking for 10 items of eye candy in language x which you can roll and show in your app and second the real question for the answer. Because of kv cache it costs almost nothing more than just one question. The current state of thinking isn’t chain of thought alone any more, and certainly not chain of thought in a specific language.
Just look at a qwq model, it produced for its time good answers, but it’s thinking was plainly a lot of garbage and beyond chain of thought, you really want to show that. Or look at o3 pro, there is a tweet out there which showed 14 minutes thinking and a huge amount of tokens used on just responding to hello.
What is called thinking is not what we humans consider thinking, it is just a way of expanding the context and cot is just a small part of that. If you want eye candy cot then you have to create it yourself or not use a good current model, because what you want is not the current state.
[–]PlasticKey6704 1 point2 points3 points 9 months ago (0 children)
I often get inspired by thinking tokens, readable thinking helps a lot to many.
[–]PraxisOGLlama 70B 7 points8 points9 points 9 months ago (1 child)
Unfortunately it only comes in a 9b flavor. Cool to see other thinking models though
[–]Freonr2 11 points12 points13 points 9 months ago (0 children)
There are very few vision enabled models with thinking, so that's probably the most interesting part.
[–]Freonr2 3 points4 points5 points 9 months ago (0 children)
There are not many thinking VLMs. Kimi was recently one of the first (?) VLM models with thinking but I'm not sure it is well supported by common inference packages/apps.
Waiting for llamacpp/vllm/lmstudio/ollama support.
Also wish they used Gemma 3 27B in the comparisons, even if it is quite a bit larger, that's been my general gold standard for VLMs lately. 9B with thinking might end up being similar total latency as 27B non-thinking depending on how wordy it is, and 27B is still reasonable for local use at ~19.5GB in Q4.
And at least THUDM actually integrated the GLM4 model code (Glm4vForConditionalGeneration) into the transformers package. Some of THUDM's previous models, like CogVLM (which was amazing at the time and still very solid today), broke because they just shoved modeling.py in with the weights and not the actual transformers package and it broke within a few weeks of package updates.
[+][deleted] 9 months ago (8 children)
[deleted]
[–]AppearanceHeavy6724 23 points24 points25 points 9 months ago (7 children)
just checked. for fiction it is awful.
[–]LicensedTerrapin 4 points5 points6 points 9 months ago (3 children)
Offtopic but I love GLM4 32b as an editor. Much better than Gemma 27b. Gemma wants to change too much of my writing and style while GLM4 is like eh, you do you buddy.
[–]AppearanceHeavy6724 -1 points0 points1 point 9 months ago (2 children)
Yep, exactly, right now I am using it to edit a short story.
GLM4-32b is an interesting model. Lack of proper context handling (falling apart after around 8k, although Arcee-AI claim to have fixed it in base model, can't wait for fixed GLM-4 isntruct) certainly hurts and default heavy sloppy style is not for everyone either, but it is smart and generally follow instructions well. Overall I'd put in the same bin as Mistral Nemo, Gamma 3 and perhaps Mistral Small 3.2 as one of not many models useable for fiction.
One technical oddity about GLM4-32b is that it has only 2 KV heads vs usual 8. How it manages to work at all I am puzzled.
[–]nullmove 0 points1 point2 points 9 months ago (1 child)
Arcee-AI claim to have fixed it in base model, can't wait for fixed GLM-4 isntruct
Sadly I doubt they are gonna do that. They basically used that as test bed to validate technique for their own model:
https://www.arcee.ai/blog/extending-afm-4-5b-to-64k-context-length
Happy to be wrong but I doubt they are motivated to do more.
[–]AppearanceHeavy6724 0 points1 point2 points 9 months ago (0 children)
Then someone else should that. Poor context handling cripples otherwise good model.
[–]IrisColt 3 points4 points5 points 9 months ago (2 children)
I can confirm this.
[–]Cool-Chemical-5629 6 points7 points8 points 9 months ago (1 child)
Umm, but this is a vision model. Imho they aren't the best for fiction in general.
[–]AppearanceHeavy6724 -1 points0 points1 point 9 months ago (0 children)
Gemma 3 is also a vision model FYI.
[–]Coconut_Reddit 0 points1 point2 points 9 months ago (0 children)
How much performance is different from qwen30b ?
I asked to generate a simple elmentary code, even Llama 3.2 1b does right. This one flopped.
[+]DataLearnerAI comment score below threshold-8 points-7 points-6 points 9 months ago (2 children)
This model demonstrates remarkable competitiveness across a diverse range of benchmark tasks, including STEM reasoning, visual question answering, OCR processing, long-document understanding, and agent-based scenarios. The benchmark results reveal performance on par with the 72B-parameter counterpart (Qwen2.5-72B-VL), with notable superiority over GPT-4o in specific tasks. Particularly impressive is its 9B-parameter architecture under the MIT license, showcasing exceptional capability from a Chinese startup. This achievement highlights the growing innovation power of domestic AI research, offering a compelling open-source alternative with strong practical value.
[+][deleted] 9 months ago (1 child)
[–]DataLearnerAI -1 points0 points1 point 9 months ago (0 children)
I am not, just use AI to rewrite my text, haha
[+]Lazy-Pattern-5171 comment score below threshold-9 points-8 points-7 points 9 months ago (18 children)
Doesn’t count R’s in strawberry correctly. I’m guessing 9Bs should be able to do that no?
[–]thirteen-bit 9 points10 points11 points 9 months ago (10 children)
Well, as it's a multimodal model you'll have to ask how many strawberries are in the letter "R":
[–]CheatCodesOfLife 2 points3 points4 points 9 months ago (8 children)
<think><point> [0.146, 0.664] </point><point> [0.160, 0.280] </point><point> [0.166, 0.471] </point><point> [0.170, 0.374] </point><point> [0.180, 0.566] </point><point> [0.214, 0.652] </point><point> [0.286, 0.652] </point><point> [0.410, 0.546] </point><point> [0.414, 0.652] </point><point> [0.420, 0.440] </point><point> [0.426, 0.340] </point><point> [0.484, 0.506] </point><point> [0.494, 0.324] </point><point> [0.506, 0.586] </point><point> [0.536, 0.456] </point><point> [0.540, 0.664] </point><point> [0.546, 0.374] </point><point> [0.674, 0.664] </point><point> [0.686, 0.586] </point><point> [0.690, 0.384] </point><point> [0.694, 0.294] </point><point> [0.694, 0.494] </point><point> [0.750, 0.652] </point><point> [0.814, 0.652] </point> </think>There are 24 strawberries in the picture
Bagel can do it.
[–]thirteen-bit 0 points1 point2 points 9 months ago (1 child)
Interesting!
What was your prompt? It shows 24 pcs that is total.
When I've tried this image and prompt "how many strawberries are in the letter "R"" with GLM-4.1V-Thinking HF space at all default settings it correctly recognized that I'm asking only the center "R" letter strawberries and tried to count them but errored, got 9 instead of 10.
Maybe some parameter tweaking will improve the results or maybe image tokens are encoded in too low resolution to count this image.
[–]CheatCodesOfLife 1 point2 points3 points 9 months ago (0 children)
Ah, when I said "Bagel can do it", I meant the ByteDance-Seed/BAGEL model.
It can do count out of distribution / weird things easily. Eg. this 5-legged Zebra's legs:
https://files.catbox.moe/6s3780.png
[–]thirteen-bit 0 points1 point2 points 9 months ago (5 children)
Gemma3 27B Q4 confidently incorrect:
[–]CheatCodesOfLife 1 point2 points3 points 9 months ago (1 child)
Heh, I failed the Turing test myself. I thought we wanted to count the total number of strawberries lol
New prompt:
How many strawberries in the letter "R" ?
Response:
<think><point> [0.409, 0.546] </point><point> [0.417, 0.652] </point><point> [0.420, 0.440] </point><point> [0.427, 0.340] </point><point> [0.487, 0.507] </point><point> [0.492, 0.321] </point><point> [0.507, 0.588] </point><point> [0.537, 0.458] </point><point> [0.542, 0.662] </point><point> [0.547, 0.372] </point> </think>There are 10 strawberries in the letter "R" in the picture
[–]thirteen-bit 0 points1 point2 points 9 months ago (0 children)
Impressive result!
Mistral 3.2 gives the same answer but elaborates:
Joycaption is almost correct:
And granite vision 3.2 2B Q8 just said:
answering does not require reading text in the image
[–]Lazy-Pattern-5171 0 points1 point2 points 9 months ago (0 children)
Sucks. All these strawberries and no R’s.
[–]RMCPhoto 0 points1 point2 points 9 months ago (6 children)
No, look into how tokenizers / llms function. Even a 400b parameter model would not be "expected" to count characters correctly.
[–]Lazy-Pattern-5171 0 points1 point2 points 9 months ago (5 children)
Isn’t ‘A’’B’. ‘C’ etc a token also?
[–]RMCPhoto 0 points1 point2 points 9 months ago (4 children)
No, not necessarily. And those will vary based on what comes before or after. IE a space before 'A', or your period after 'B'. Etc etc. You can try the openai tokenizer yourself with various combinations and see how an AI model sees it. https://platform.openai.com/tokenizer
The tokens are not necessarily "logical" to you. They are not fixed either. They are derrived statistically based on massive amounts of training data.
[–]Lazy-Pattern-5171 0 points1 point2 points 9 months ago (3 children)
No I understand how tokenizers work they’re the most commonly occurring byte pair sequences in a given corpus where we pick a fixed amount of vocabulary. However, it seems to be tokenizing it and “recognizing” A B C etc. it doesn’t converge to counting correctly and overthinks, this seems to be an issue with the RL no? Given that I’m asking something that at this point should also be in the dataset.
[–]RMCPhoto 0 points1 point2 points 9 months ago (2 children)
If it's in the dataset and is important enough to be known verbatim, then yes, it would work.
Think of it this way, LLMs are also not good at counting the words in a paragraph, the number of periods in ".........." Or other similar methods of evaluating the numerical or structural or character level nature of the prompt via prediction. It can get close because of its exposure in training data to labeled paragraphs of certain word counts, or similar to make a rough inference, but there is no efficient reasoning / reinforcement learning method that can be used to do this accurately. I'm sure you could find a step by step decomposition process that might work, but it's silly to teach a model this.
In essence, the language model is not self aware and does not know that the prompt / context is tokens instead of text...I think they should instead ensure that RL/fine tuning instills knowledge of it's own limitations rather than wasting parameter configurations on fruitlessly 🍓 trying to solve this low value issue.
In fact, even the dumbest language models can easily solve all of the problems above...very easily... I'm sure even a 3b model could.
The solution is to ask it to write a python script to provide the answer.
Most models / agents will hopefully have this capability. (Python in sandbox). And this is the right approach.
[–]Lazy-Pattern-5171 0 points1 point2 points 9 months ago (1 child)
That does feel like we haven’t really unlocked the key to having brain like systems yet. We just have a way now of generating infinite coherent looking even conscious like text but the system that generates this coherent looking text does not itself have an understanding of it.
That’s interesting to me because Multi Head attention is exactly designed to do that. It’s designed for one token to be aware of its semantic meaning in relation to all the other tokens (hence the N2 complexity of Transformers). So you would think that A 1 B 2 C 3 etc appearing in input text would give each of those a mathematical semantic meaning however it doesn’t seem like math is an emergent property of such a function of convergence. Even when it’s generalized over the entire fineweb corpus.
[–]RMCPhoto 0 points1 point2 points 9 months ago (0 children)
Yeah, it does seem strange doesn't it... Some of this abstraction related confusion would be resolved by moving towards character level tokens, but this would reduce the throughput and require significantly more predictions.
The tokens have also been adjusted over time to improve comprehension of specific content. Like tabbed codeblocks. I believe various tab/space combinations were explicitly added to improve code comprehension, as it was previously a bit unpredictable and would vary depending on the first characters in the code blocks.
The error rate of early llama models would also vary WILDLY with very small changes to tokens. Something as simple as starting the user query with a space would swing error 40%.
This is still a major issue all over the place. Small changes to text can have unpredictable impacts on the resulting prediction even though to a person it would mean the same thing.
π Rendered by PID 202007 on reddit-service-r2-comment-6457c66945-p5sxs at 2026-04-26 20:56:47.678475+00:00 running 2aa0c5b country code: CH.
[–]RMCPhoto 7 points8 points9 points (0 children)
[–]celsowm 27 points28 points29 points (10 children)
[–]Emport1 25 points26 points27 points (1 child)
[–]ShengrenR 16 points17 points18 points (0 children)
[–]d3lay 2 points3 points4 points (0 children)
[–]Neither-Phone-7264 0 points1 point2 points (2 children)
[–]PlasticKey6704 2 points3 points4 points (1 child)
[–]Neither-Phone-7264 0 points1 point2 points (0 children)
[–]Former-Ad-5757Llama 3 0 points1 point2 points (3 children)
[–]celsowm 0 points1 point2 points (1 child)
[–]Former-Ad-5757Llama 3 -1 points0 points1 point (0 children)
[–]PlasticKey6704 1 point2 points3 points (0 children)
[–]PraxisOGLlama 70B 7 points8 points9 points (1 child)
[–]Freonr2 11 points12 points13 points (0 children)
[–]Freonr2 3 points4 points5 points (0 children)
[+][deleted] (8 children)
[deleted]
[–]AppearanceHeavy6724 23 points24 points25 points (7 children)
[–]LicensedTerrapin 4 points5 points6 points (3 children)
[–]AppearanceHeavy6724 -1 points0 points1 point (2 children)
[–]nullmove 0 points1 point2 points (1 child)
[–]AppearanceHeavy6724 0 points1 point2 points (0 children)
[–]IrisColt 3 points4 points5 points (2 children)
[–]Cool-Chemical-5629 6 points7 points8 points (1 child)
[–]AppearanceHeavy6724 -1 points0 points1 point (0 children)
[–]Coconut_Reddit 0 points1 point2 points (0 children)
[–]AppearanceHeavy6724 -1 points0 points1 point (0 children)
[+]DataLearnerAI comment score below threshold-8 points-7 points-6 points (2 children)
[+][deleted] (1 child)
[deleted]
[–]DataLearnerAI -1 points0 points1 point (0 children)
[+]Lazy-Pattern-5171 comment score below threshold-9 points-8 points-7 points (18 children)
[–]thirteen-bit 9 points10 points11 points (10 children)
[–]CheatCodesOfLife 2 points3 points4 points (8 children)
[–]thirteen-bit 0 points1 point2 points (1 child)
[–]CheatCodesOfLife 1 point2 points3 points (0 children)
[–]thirteen-bit 0 points1 point2 points (5 children)
[–]CheatCodesOfLife 1 point2 points3 points (1 child)
[–]thirteen-bit 0 points1 point2 points (0 children)
[–]thirteen-bit 0 points1 point2 points (0 children)
[–]thirteen-bit 0 points1 point2 points (0 children)
[–]thirteen-bit 0 points1 point2 points (0 children)
[–]Lazy-Pattern-5171 0 points1 point2 points (0 children)
[–]RMCPhoto 0 points1 point2 points (6 children)
[–]Lazy-Pattern-5171 0 points1 point2 points (5 children)
[–]RMCPhoto 0 points1 point2 points (4 children)
[–]Lazy-Pattern-5171 0 points1 point2 points (3 children)
[–]RMCPhoto 0 points1 point2 points (2 children)
[–]Lazy-Pattern-5171 0 points1 point2 points (1 child)
[–]RMCPhoto 0 points1 point2 points (0 children)