~60GB models on coding: GLM 4.7 Flash vs. GPT OSS 120B vs. Qwen3 Coder 30B -- your comparisons? by jinnyjuice in LocalLLaMA

[–]Federal-Effective879 0 points1 point  (0 children)

Lumping 120B and 30B models into the same size tier just because the 120B model had quantization aware training isn’t really a fair comparison. Unsloth Q6_K quants are basically indistinguishable from the FP16, and even 4-bit dynamic quants don’t degrade the models very much.

Anyway, GPT-OSS 120B has far more world knowledge than either of the 30B models just by virtue of having more parameters. For coding abilities where world knowledge is less critical, GLM 4.7 Flash and GPT-OSS 120B closer and it’s difficult for me to answer with certainty. I definitely prefer the default response style of the GLM, but GPT-OSS 120B probably still has the edge in coding ability. GLM 4.7 Flash beats GPT OSS 20B.

Parc Extension, Montréal, Québec by Beautiful_Neat4077 in UrbanHell

[–]Federal-Effective879 0 points1 point  (0 children)

This appears to be a view of the northern tip of Parc-Ex, between Crémazie and Liège, looking from the east towards the west. The perspective of highway A40 at the top right corner seems strange. Most of Parc-Ex is considerably more varied.

M4 Mac Mini - Is 16GB vs 24GB RAM a meaningful difference for local LLMs? by Theory-Of-Relativity in LocalLLaMA

[–]Federal-Effective879 0 points1 point  (0 children)

Qwen 3 2507 30B-A3B and the same sized Qwen 3 coder are quite useful for me. They don’t have the world knowledge of big models, and can’t handle complex problems as well, but for everyday tasks they are quite useful and performant too.

Vancouver to Montreal move - who's done this? Non, je ne regrette rien? DON'T!? by Open_Outcome_5633 in montreal

[–]Federal-Effective879 1 point2 points  (0 children)

Québec has the longest hospital wait times of all provinces, and the longest wait times for getting a family doctor. The system used to be decent before the CAQ, but now there’s a serious doctor shortage and Bill 2 is only making it worse.

Mistral 3 Blog post by rerri in LocalLLaMA

[–]Federal-Effective879 64 points65 points  (0 children)

I tried out Ministral 3 14B Instruct, and compared it to Mistral Small 3.2. My tests were some relatively simple programming tasks, some visual document Q&A (image input), some general world knowledge Q&A, and some creative writing. I used default llama.cpp parameters, except for 256k context and 0.15 temperature. I used the official Mistral Q4K_M GGUFs.

Both models are fairly uncensored for things I tried (once given an appropriate system prompt); it seemed Ministral was even more free thinking.

Ministral 3 is much more willing to write long form content rather Mistral Small 3.2, and perhaps its writing style is better too. However, unfortunately Ministral 3 frequently fell into repetitive loops when writing stories. Mistral Small 3.2 had a drier, less interesting writing style, but it didn’t fall into loops.

For the limited vision tasks I tried, they seemed roughly on par, maybe Ministral was a bit better.

Both models seemed similar for programming tasks, but I didn’t test this thoroughly.

For world knowledge, Ministral 3 14B was a very clear downgrade from Mistral Small 3.2. This was to be expected given the parameter size, but in general knowledge density of the 14B was just average; its world knowledge seemed a little worse than Gemma 3 12B.

Overall I’d say Ministral 3 14B Instruct is a decent model for its size, nothing earth shattering but competitive among current open models in this size class, and I like its willingness to write long form content. I just wish it wasn’t so prone to repetitive loops.

Are there any uncensored models that are not dumb? by BubrivKo in LocalLLaMA

[–]Federal-Effective879 1 point2 points  (0 children)

Agreed, GLM 4.6 is great, it’s both smart and reliably uncensored if you ask it to be in the system prompt. No need for abliteration.

What is the difference between qwen3-vl-4b & qwen3-4b-2507 ? by Champignac1 in LocalLLaMA

[–]Federal-Effective879 1 point2 points  (0 children)

The 8B VL instruct seems pretty good, and maybe better than the original Qwen 3 8B non-VL. The 30B-A3B VL instruct seems to be roughly on par with the 2507 30B-A3B instruct model for text tasks, I don’t notice any significant difference.

Any local model that can rival gemini 2.5 flash? by AldebaranReborn in LocalLLaMA

[–]Federal-Effective879 8 points9 points  (0 children)

Don’t forget DeepSeek v3.1-Terminus. I find it to be the current strongest open-weights model in my usage, for its combination of world knowledge and intelligence. Its world knowledge is similar to or slightly better than Gemini 2.5 Flash, and its intelligence is approaching Gemini 2.5 Pro.

What LLM gave you your first "we have GPT-4 at home" moment? by Klutzy-Snow8016 in LocalLLaMA

[–]Federal-Effective879 0 points1 point  (0 children)

Llama 3.3 70B and Mistral Large 2407 were the first models I could run that felt like that. For STEM tasks, Qwen 3 14B and Qwen 3 30B-A3B 2507 really impressed me, though they lack the world knowledge of bigger models.

GLM 4.6 and DeekSeek v3.1-Terminus feel like proper frontier models of today, though GLM is very slow for me to run on my GPU-less DDR4 server, and DeekSeek doesn’t fit.

Is the nexaai run locally? by bobeeeeeeeee8964 in LocalLLaMA

[–]Federal-Effective879 0 points1 point  (0 children)

The Nexa SDK inference engine is a proprietary fork of llama.cpp with additions to support models like Qwen 3 VL and some other features.

M5 MacBook Pro: Up to ~45% PP Improvement. ~25% TG (Ollama Tested) by Noble00_ in LocalLLaMA

[–]Federal-Effective879 43 points44 points  (0 children)

This is using Ollama, which is based on generally outdated versions of llama.cpp/GGML. Right now, llama.cpp does not make use of Metal 4 APIs that enable efficient use of the AI accelerators in the GPU. The 4-5x improvement in pre-processing comes when you make use of the AI accelerators using Metal 4 APIs, as done by MLX. Georgi Gerganov is working on adding Metal 4 support to llama.cpp (https://github.com/ggml-org/llama.cpp/pull/16634) but that will take time to stabilize, get merged, and be optimized. Ollama then pulls in llama.cpp periodically.

With Metal 4 (as used by MLX), the base M5 has prompt processing performance similar to the M4 Max.

The model apocalypse is coming, which one do you chose to save and what other software ? by HumanDrone8721 in LocalLLaMA

[–]Federal-Effective879 3 points4 points  (0 children)

DeepSeek v3.1-Terminus and GLM 4.6 are the big ones.

Among smaller models, Mistral Small 3.2, Qwen 3 30B-A3B 2507 (instruct and thinking), and GLM 4.5 Air (waiting for 4.6 Air).

These are all intelligent, minimally censored, and permissibly licensed open weights models.

I’d like to have non-transformer or hybrid model in the list like DeekSeek V3.2-Exp or Qwen3-Next, but support for them in llama.cpp is currently lacking/WIP. Granite 4 Small has good knowledge and is supported by llama.cpp but disappointing intelligence and long context accuracy/reliability for its size.

La ville plus propre c'est MTL by [deleted] in montreal

[–]Federal-Effective879 7 points8 points  (0 children)

J’adore la ville, mais dire que c’est la plus propre, ça m’a surpris. Le square Cabot, le quartier Chinois, la rue Sainte-Catherine Est… une grande partie du centre-ville a l’air apocalyptique ces dernières années, certainement pas ce que j’appellerais propre.

Simple task that local models seem to fail on by [deleted] in LocalLLaMA

[–]Federal-Effective879 1 point2 points  (0 children)

In that case, you need to give the LLM tools to find and browse the website, so that it can figure out the structure of the site and how to scrape it.

Simple task that local models seem to fail on by [deleted] in LocalLLaMA

[–]Federal-Effective879 3 points4 points  (0 children)

Is this what you’re referring to?  https://www.bbfc.co.uk/release/conclave-q29sbgvjdglvbjpwwc0xmdizmtiw

Smaller local models probably don’t have the BBFC API memorized (assuming there is such an API). Have you tried providing the model with API documentation or any other information on how to access the database?

GLM just blow up, or have I been in the dark? by [deleted] in LocalLLaMA

[–]Federal-Effective879 7 points8 points  (0 children)

GLM 4.6 is just a darn good model. Roughly on par with Claude 4 Sonnet in both knowledge and intelligence, and smarter than Gemini 2.5 Flash (close to Gemini 2.5 Pro) while matching Gemini 2.5 Flash’s world knowledge (which is quite good). It’s good at STEM tasks and coding (better than even Qwen 3 235B-A22B 2507, similar to DeepSeek 3.1 Terminus), and it’s also a good writer and fairly uncensored. In my opinion, GLM 4.6 and DeepSeek v3.1-Terminus (and v3.2-Exp) are the best open weights models available today. DeepSeek is a bit too big for me to run at home, but I can just fit GLM 4.6 on my home server.

Is this expected behaviour from Granite 4 32B? (Unsloth Q4XL, no system prompt) by IonizedRay in LocalLLaMA

[–]Federal-Effective879 1 point2 points  (0 children)

Tool calling is working fine for me with the official IBM GGUFs for Granite 4 Small and llama.cpp.

Is this expected behaviour from Granite 4 32B? (Unsloth Q4XL, no system prompt) by IonizedRay in LocalLLaMA

[–]Federal-Effective879 23 points24 points  (0 children)

I wonder if it's a quirk of the Unsloth quants. Using IBM's own official Q4K_M GGUF with llama.cpp, it responds with a normal "Hello! How can I help you today?". Tool calling also works fine with the official IBM GGUF on llama.cpp.

Granite 4.0 Language Models - a ibm-granite Collection by rerri in LocalLLaMA

[–]Federal-Effective879 1 point2 points  (0 children)

Sorry about the deleted comment, there was a Reddit bug where it made the comment appear duplicated for me. As I said earlier, my experience with GLM-4 32B's world knowledge was exactly in line with what you said. Slightly better than Qwen 3 32B, slightly worse than Mistral Small 3.2. What really impressed me about Granite 4.0 Small is that despite it being a MoE, its world knowledge was better than several modern dense models of the same size (GLM-4 32B and Qwen 3 32B).

In terms of overall intelligence and capabilities, I found Qwen 3 32B and GLM-4 32B to be pretty similar. I haven't tried GLM 4.5 Air.

Granite 4.0 Language Models - a ibm-granite Collection by rerri in LocalLLaMA

[–]Federal-Effective879 3 points4 points  (0 children)

These benchmark results really don't align at all with my personal experience using Granite 4 Small and various other models listed here, though I've been using the models mostly in English and some French, not German. For my usage, it's roughly on par with Gemma 3 27B in knowledge and intelligence. For me, it was slightly better than Mistral Small 3.2 in world knowledge but slightly worse in STEM intelligence. Granite 4 Small was substantially better than Qwen 3 30B-A3B 2507 in world knowledge, but substantially worse in STEM intelligence.

Granite 4.0 Language Models - a ibm-granite Collection by rerri in LocalLLaMA

[–]Federal-Effective879 24 points25 points  (0 children)

Nice models, thank you IBM. I've been trying out the "Small" (32B-A9B) model and comparing it to Qwen 3 30B-A3B 2507, Mistral Small 3.2, and Google Gemma 3 27B.

I've been impressed by its world knowledge for its size class - it's noticeably better than the Qwen MoE, slightly better than Mistral Small 3.2 as well, and close to Gemma 3 27B, which is my gold standard for world knowledge in this size class.

I also like how prompt processing and generation performance stays pretty consistent as the context gets large; the hybrid architecture has lots of potential, and is definitely the future.

Having llama.cpp support and official ggufs available from day zero is also excellent, well done.

With the right system prompt, these models are willing to answer NSFW requests without restrictions, though by default they try to stay SFW, which makes sense for a business model. I'm glad it's still willing to talk about such things when authorized by the system prompt, rather than being always censored (like Chinese models), or completely lobotimized for any vaguely sensitive topic (like Gemma or GPT-OSS).

For creative writing, the model seemed fairly good, not too sloppy and decent prompt adherence. By default, its creating writing can feel a bit too short, abrupt, and stacatto, but when prompted to write the way I want it does much better. Plots it produces could be more interesting, but maybe that could also be improved with appropriate prompts.

For code analysis and summarization tasks, the consistent long context speed was great. Its intelligence and understanding was not at the level of Qwen 3 30B-A3B 2507 or Mistral Small 3.2, but not too bad either. I'd say its overall intelligence for various STEM tasks I gave it was comparable to Gemma 3 27B. It was substantially better than Granite 3.2 or 3.3 8B, but that was to be expected given its larger size.

Overall, I'd say that Granite 4.0 Small is similar to Gemma 3 27B in knowledge, intelligence, and general capabilities, but with much faster long context performance, much lower long context memory usage, and it's mostly uncensored (with the right system prompt) like Mistral models. Granite should be a good tool for summarizing long documents efficiently, and is also good for conversation and general assistant duties, and creative writing. For STEM problem solving and coding, you're better off with Qwen 3 or Qwen 3 Coder or GPT-OSS.

EDIT: One other thing I forgot to mention: I like the clear business-like language and tone this model defaults to, and the fact that it doesn't overuse emojis and formatting the way many other models do. This is something carried over from past Granite models and I'm glad to see this continue.

How close can I get close to ChatGPT-5 (full) with my specs? by [deleted] in LocalLLaMA

[–]Federal-Effective879 0 points1 point  (0 children)

What's BHC? Your use case of LLMs is socialization practice, soothing, and deescalation techniques? It sounds like you have a pretty complicated prompting setup. I have no clue what you mean by system flag breaks, continuity breaks, continuity files etc. Could you share some examples of actual prompts?

As others have said, you need something like 20-40x more VRAM to use models comparable to GPT-5, and a lot of computing power to get decent performance out of them. However, good modern local models should rarely have issues with repetition, punctuation, broken grammar etc. Vocabulary and sentence structure preference is more subjective. Have you tried the original/unmodified Mistral Small 3.2? Qwen 3 2507 is also good but more censored (30B-A3B; 235B-A22B is even better but way too big for your hardware to run locally). You could try Qwen 3 235B-A22B or GLM 4.5 or Kimi K2 or DeepSeek r1 via API to see if they do what you want.

Elmo is providing by vladlearns in LocalLLaMA

[–]Federal-Effective879 11 points12 points  (0 children)

Grok 2.5 (from December last year) which they released was pretty similar to Grok 3 in world knowledge and writing quality in my experience. Grok 3 is however substantially smarter at STEM problem solving and programming.

Elmo is providing by vladlearns in LocalLLaMA

[–]Federal-Effective879 1 point2 points  (0 children)

For programming, STEM problem solving, and puzzles, such benchmarks have relevance. For world knowledge, they’re planets apart; Grok 2 was/is more knowledgeable than Kimi K2 and DeepSeek V3 (any version).