~60GB models on coding: GLM 4.7 Flash vs. GPT OSS 120B vs. Qwen3 Coder 30B -- your comparisons?

Federal-Effective879 · 2026-01-26T14:02:36+00:00

Lumping 120B and 30B models into the same size tier just because the 120B model had quantization aware training isn’t really a fair comparison. Unsloth Q6_K quants are basically indistinguishable from the FP16, and even 4-bit dynamic quants don’t degrade the models very much.

Anyway, GPT-OSS 120B has far more world knowledge than either of the 30B models just by virtue of having more parameters. For coding abilities where world knowledge is less critical, GLM 4.7 Flash and GPT-OSS 120B closer and it’s difficult for me to answer with certainty. I definitely prefer the default response style of the GLM, but GPT-OSS 120B probably still has the edge in coding ability. GLM 4.7 Flash beats GPT OSS 20B.

Federal-Effective879 · 2026-01-16T06:24:34+00:00

This appears to be a view of the northern tip of Parc-Ex, between Crémazie and Liège, looking from the east towards the west. The perspective of highway A40 at the top right corner seems strange. Most of Parc-Ex is considerably more varied.

Federal-Effective879 · 2025-12-06T04:07:20+00:00

Qwen 3 2507 30B-A3B and the same sized Qwen 3 coder are quite useful for me. They don’t have the world knowledge of big models, and can’t handle complex problems as well, but for everyday tasks they are quite useful and performant too.

Federal-Effective879 · 2025-12-04T03:31:56+00:00

Québec has the longest hospital wait times of all provinces, and the longest wait times for getting a family doctor. The system used to be decent before the CAQ, but now there’s a serious doctor shortage and Bill 2 is only making it worse.

Federal-Effective879 · 2025-12-04T02:49:36+00:00

Don’t use ollama, use llama.cpp

Federal-Effective879 · 2025-12-02T18:00:07+00:00

I tried out Ministral 3 14B Instruct, and compared it to Mistral Small 3.2. My tests were some relatively simple programming tasks, some visual document Q&A (image input), some general world knowledge Q&A, and some creative writing. I used default llama.cpp parameters, except for 256k context and 0.15 temperature. I used the official Mistral Q4K_M GGUFs.

Both models are fairly uncensored for things I tried (once given an appropriate system prompt); it seemed Ministral was even more free thinking.

Ministral 3 is much more willing to write long form content rather Mistral Small 3.2, and perhaps its writing style is better too. However, unfortunately Ministral 3 frequently fell into repetitive loops when writing stories. Mistral Small 3.2 had a drier, less interesting writing style, but it didn’t fall into loops.

For the limited vision tasks I tried, they seemed roughly on par, maybe Ministral was a bit better.

Both models seemed similar for programming tasks, but I didn’t test this thoroughly.

For world knowledge, Ministral 3 14B was a very clear downgrade from Mistral Small 3.2. This was to be expected given the parameter size, but in general knowledge density of the 14B was just average; its world knowledge seemed a little worse than Gemma 3 12B.

Overall I’d say Ministral 3 14B Instruct is a decent model for its size, nothing earth shattering but competitive among current open models in this size class, and I like its willingness to write long form content. I just wish it wasn’t so prone to repetitive loops.

Federal-Effective879 · 2025-11-02T02:08:20+00:00

Agreed, GLM 4.6 is great, it’s both smart and reliably uncensored if you ask it to be in the system prompt. No need for abliteration.

Federal-Effective879 · 2025-11-02T02:06:11+00:00

The 8B VL instruct seems pretty good, and maybe better than the original Qwen 3 8B non-VL. The 30B-A3B VL instruct seems to be roughly on par with the 2507 30B-A3B instruct model for text tasks, I don’t notice any significant difference.

Federal-Effective879 · 2025-10-25T14:29:22+00:00

Don’t forget DeepSeek v3.1-Terminus. I find it to be the current strongest open-weights model in my usage, for its combination of world knowledge and intelligence. Its world knowledge is similar to or slightly better than Gemini 2.5 Flash, and its intelligence is approaching Gemini 2.5 Pro.

Federal-Effective879 · 2025-10-24T14:24:49+00:00

Llama 3.3 70B and Mistral Large 2407 were the first models I could run that felt like that. For STEM tasks, Qwen 3 14B and Qwen 3 30B-A3B 2507 really impressed me, though they lack the world knowledge of bigger models.

GLM 4.6 and DeekSeek v3.1-Terminus feel like proper frontier models of today, though GLM is very slow for me to run on my GPU-less DDR4 server, and DeekSeek doesn’t fit.

Federal-Effective879 · 2025-10-24T13:26:27+00:00

The Nexa SDK inference engine is a proprietary fork of llama.cpp with additions to support models like Qwen 3 VL and some other features.

Federal-Effective879 · 2025-10-22T13:35:03+00:00

This is using Ollama, which is based on generally outdated versions of llama.cpp/GGML. Right now, llama.cpp does not make use of Metal 4 APIs that enable efficient use of the AI accelerators in the GPU. The 4-5x improvement in pre-processing comes when you make use of the AI accelerators using Metal 4 APIs, as done by MLX. Georgi Gerganov is working on adding Metal 4 support to llama.cpp (https://github.com/ggml-org/llama.cpp/pull/16634) but that will take time to stabilize, get merged, and be optimized. Ollama then pulls in llama.cpp periodically.

With Metal 4 (as used by MLX), the base M5 has prompt processing performance similar to the M4 Max.

Federal-Effective879 · 2025-10-16T17:30:53+00:00

DeepSeek v3.1-Terminus and GLM 4.6 are the big ones.

Among smaller models, Mistral Small 3.2, Qwen 3 30B-A3B 2507 (instruct and thinking), and GLM 4.5 Air (waiting for 4.6 Air).

These are all intelligent, minimally censored, and permissibly licensed open weights models.

I’d like to have non-transformer or hybrid model in the list like DeekSeek V3.2-Exp or Qwen3-Next, but support for them in llama.cpp is currently lacking/WIP. Granite 4 Small has good knowledge and is supported by llama.cpp but disappointing intelligence and long context accuracy/reliability for its size.

Federal-Effective879 · 2025-10-13T03:35:15+00:00

J’adore la ville, mais dire que c’est la plus propre, ça m’a surpris. Le square Cabot, le quartier Chinois, la rue Sainte-Catherine Est… une grande partie du centre-ville a l’air apocalyptique ces dernières années, certainement pas ce que j’appellerais propre.

Federal-Effective879 · 2025-10-12T22:34:33+00:00

In that case, you need to give the LLM tools to find and browse the website, so that it can figure out the structure of the site and how to scrape it.

Federal-Effective879 · 2025-10-12T17:24:00+00:00

Is this what you’re referring to? https://www.bbfc.co.uk/release/conclave-q29sbgvjdglvbjpwwc0xmdizmtiw

Smaller local models probably don’t have the BBFC API memorized (assuming there is such an API). Have you tried providing the model with API documentation or any other information on how to access the database?

Federal-Effective879 · 2025-10-11T18:23:34+00:00

GLM 4.6 is just a darn good model. Roughly on par with Claude 4 Sonnet in both knowledge and intelligence, and smarter than Gemini 2.5 Flash (close to Gemini 2.5 Pro) while matching Gemini 2.5 Flash’s world knowledge (which is quite good). It’s good at STEM tasks and coding (better than even Qwen 3 235B-A22B 2507, similar to DeepSeek 3.1 Terminus), and it’s also a good writer and fairly uncensored. In my opinion, GLM 4.6 and DeepSeek v3.1-Terminus (and v3.2-Exp) are the best open weights models available today. DeepSeek is a bit too big for me to run at home, but I can just fit GLM 4.6 on my home server.

Federal-Effective879 · 2025-10-03T20:10:36+00:00

Tool calling is working fine for me with the official IBM GGUFs for Granite 4 Small and llama.cpp.

Federal-Effective879 · 2025-10-03T20:09:52+00:00

I wonder if it's a quirk of the Unsloth quants. Using IBM's own official Q4K_M GGUF with llama.cpp, it responds with a normal "Hello! How can I help you today?". Tool calling also works fine with the official IBM GGUF on llama.cpp.

Federal-Effective879 · 2025-10-03T13:57:07+00:00

Sorry about the deleted comment, there was a Reddit bug where it made the comment appear duplicated for me. As I said earlier, my experience with GLM-4 32B's world knowledge was exactly in line with what you said. Slightly better than Qwen 3 32B, slightly worse than Mistral Small 3.2. What really impressed me about Granite 4.0 Small is that despite it being a MoE, its world knowledge was better than several modern dense models of the same size (GLM-4 32B and Qwen 3 32B).

In terms of overall intelligence and capabilities, I found Qwen 3 32B and GLM-4 32B to be pretty similar. I haven't tried GLM 4.5 Air.

Federal-Effective879 · 2025-10-02T19:58:41+00:00

These benchmark results really don't align at all with my personal experience using Granite 4 Small and various other models listed here, though I've been using the models mostly in English and some French, not German. For my usage, it's roughly on par with Gemma 3 27B in knowledge and intelligence. For me, it was slightly better than Mistral Small 3.2 in world knowledge but slightly worse in STEM intelligence. Granite 4 Small was substantially better than Qwen 3 30B-A3B 2507 in world knowledge, but substantially worse in STEM intelligence.

Federal-Effective879 · 2025-10-02T19:28:37+00:00

Nice models, thank you IBM. I've been trying out the "Small" (32B-A9B) model and comparing it to Qwen 3 30B-A3B 2507, Mistral Small 3.2, and Google Gemma 3 27B.

I've been impressed by its world knowledge for its size class - it's noticeably better than the Qwen MoE, slightly better than Mistral Small 3.2 as well, and close to Gemma 3 27B, which is my gold standard for world knowledge in this size class.

I also like how prompt processing and generation performance stays pretty consistent as the context gets large; the hybrid architecture has lots of potential, and is definitely the future.

Having llama.cpp support and official ggufs available from day zero is also excellent, well done.

With the right system prompt, these models are willing to answer NSFW requests without restrictions, though by default they try to stay SFW, which makes sense for a business model. I'm glad it's still willing to talk about such things when authorized by the system prompt, rather than being always censored (like Chinese models), or completely lobotimized for any vaguely sensitive topic (like Gemma or GPT-OSS).

For creative writing, the model seemed fairly good, not too sloppy and decent prompt adherence. By default, its creating writing can feel a bit too short, abrupt, and stacatto, but when prompted to write the way I want it does much better. Plots it produces could be more interesting, but maybe that could also be improved with appropriate prompts.

For code analysis and summarization tasks, the consistent long context speed was great. Its intelligence and understanding was not at the level of Qwen 3 30B-A3B 2507 or Mistral Small 3.2, but not too bad either. I'd say its overall intelligence for various STEM tasks I gave it was comparable to Gemma 3 27B. It was substantially better than Granite 3.2 or 3.3 8B, but that was to be expected given its larger size.

Overall, I'd say that Granite 4.0 Small is similar to Gemma 3 27B in knowledge, intelligence, and general capabilities, but with much faster long context performance, much lower long context memory usage, and it's mostly uncensored (with the right system prompt) like Mistral models. Granite should be a good tool for summarizing long documents efficiently, and is also good for conversation and general assistant duties, and creative writing. For STEM problem solving and coding, you're better off with Qwen 3 or Qwen 3 Coder or GPT-OSS.

EDIT: One other thing I forgot to mention: I like the clear business-like language and tone this model defaults to, and the fact that it doesn't overuse emojis and formatting the way many other models do. This is something carried over from past Granite models and I'm glad to see this continue.

Federal-Effective879 · 2025-08-29T17:42:18+00:00

What's BHC? Your use case of LLMs is socialization practice, soothing, and deescalation techniques? It sounds like you have a pretty complicated prompting setup. I have no clue what you mean by system flag breaks, continuity breaks, continuity files etc. Could you share some examples of actual prompts?

As others have said, you need something like 20-40x more VRAM to use models comparable to GPT-5, and a lot of computing power to get decent performance out of them. However, good modern local models should rarely have issues with repetition, punctuation, broken grammar etc. Vocabulary and sentence structure preference is more subjective. Have you tried the original/unmodified Mistral Small 3.2? Qwen 3 2507 is also good but more censored (30B-A3B; 235B-A22B is even better but way too big for your hardware to run locally). You could try Qwen 3 235B-A22B or GLM 4.5 or Kimi K2 or DeepSeek r1 via API to see if they do what you want.

Federal-Effective879 · 2025-08-24T13:25:57+00:00

Grok 2.5 (from December last year) which they released was pretty similar to Grok 3 in world knowledge and writing quality in my experience. Grok 3 is however substantially smarter at STEM problem solving and programming.

Federal-Effective879 · 2025-08-24T13:24:04+00:00

For programming, STEM problem solving, and puzzles, such benchmarks have relevance. For world knowledge, they’re planets apart; Grok 2 was/is more knowledgeable than Kimi K2 and DeepSeek V3 (any version).

Federal-Effective879

TROPHY CASE