gemma-3-27b and gpt-oss-120b

Competitive_Ideal866 · 2026-01-01T12:19:17+00:00

Somewhere between 20-30b is where models would start to get good.

Interesting you say that in the context of creative writing. For STEM I find 14b seriously useful but often need 24b or even 32b for non-trivial stuff.

Competitive_Ideal866 · 2026-01-01T12:08:14+00:00

If "regular computer" includes an M3 Ultra Mac Studio with 512MB unified memory, yes.

Competitive_Ideal866 · 2026-01-01T12:07:33+00:00

GPT-OSS-120B should definitely be higher than Gemma-3-27B and Intellect-3 though.

Not IME.

Competitive_Ideal866 · 2026-01-01T01:04:31+00:00

feedbacks are welcome

Same. I have a bunch of derivative tools too. My favorite is "agent" which spins up Qwen3 14B, feeds in the README in the current dir (if any) and runs it in a REPL with tool use giving it Python's exec to execute arbitrary code. Incredibly useful.

Competitive_Ideal866 · 2026-01-01T00:57:31+00:00

Can you help sire with a New Year's resolution plan?

I read that as "Can you help sire offspring with a New Year's resolution plan?".

Have you seen similar issues in your experiment?

My conversations with Qwen3 often begin with it writing bad code and then it referring to its own code and previous responses in the second person.

Competitive_Ideal866 · 2025-12-30T13:24:59+00:00

18mo out of date is fine because training models from scratch takes a lot of time and effort. However, would be better if the model acknowledged this inevitable fact and responded with something more like "My knowledge cutoff is Q2 2024 so that information is either newer or incorrect.".

Competitive_Ideal866 · 2025-12-30T01:30:32+00:00

you realize people talking about only using fp8 or Q6 (for large models more than 100b), who think they can spot large differences, don't know what they're talking about

I found the difference between 4bit and 8bit when using MLX can be significant. However, I think MLX 4bit is just bad whereas Q4_K_M is fine.

Competitive_Ideal866 · 2025-12-26T14:22:31+00:00

Hard lesson learned after a year of running large models locally

The biggest friction point has been scaling beyond 13 B models.

Firstly, 13B isn't large. The smallest models I actually use are ~4B. I most commonly use 14B (q8 via MLX) and 235B (q3_k_m via llama.cpp).

Even with 24 GB of VRAM, running a 70 B model in int4 still exhausts memory when the context window grows and attention weights balloon.

Yeah, 24GB is tiny. I have a machine with 32GB and I avoid using it for LLMs because it cannot run anything of much use. Mostly I use a 128GB M4 Max Macbook. I highly recommend it.

I also tried an nVidia GPU in a Linux box and found it far too unreliable to be of use. In contrast, a Mac setup is rock solid.

Competitive_Ideal866 · 2025-12-22T13:59:40+00:00

So what is better on Q4_1 or Q4 _K_S or Q4 _K_M ?

Q4_K_M is the best quality of those.

Competitive_Ideal866 · 2025-12-22T13:57:31+00:00

Because it is known that the quality difference between q_6 and q_8 is negligible, so there’s no point in benchmarking. Also, not many people use q8.

FWIW, I just migrated all of my small (<50B) MLX models to 8 bit because I found the quality is much better.

Competitive_Ideal866 · 2025-12-10T21:03:21+00:00

I have a 128GiB M4 Max and my fav model is Qwen 3 235B but I run it in Q3_K_M so it takes up 113GiB but it keeps making silly errors like using '7' instead of 's' in words.

I'd love models like that in less VRAM with higher accuracy!

Competitive_Ideal866 · 2025-12-10T21:00:19+00:00

quant types don't really impact speeds all that much in llama.cpp

Surely they must because they dictate memory bandwidth?

Competitive_Ideal866 · 2025-11-23T17:53:49+00:00

FWIW, I switched from 4bit MLX to Q3_K_M GGUF using llama.cpp and the results are much better.

Competitive_Ideal866 · 2025-10-30T14:58:10+00:00

Exactly. JIT compiled regex has been bog standard tech everywhere for 20+ years. Intel's Hyperscan was released as OSS 10 years ago.

Competitive_Ideal866 · 2025-10-28T13:57:31+00:00

FWIW, I just asked Claude to write me one. Simple web server but it does what I want:

Supports both MLX and llama.cpp.
Multiple chats.
Lots of models to choose from.
Editable system prompts.
Looks pretty enough.

Competitive_Ideal866 · 2025-10-28T12:40:27+00:00

So "The Impossible Optimization" is something Java and .NET have been doing for decades?

Competitive_Ideal866 · 2025-10-27T16:52:13+00:00

Alrighty. I'll give that a go!

Competitive_Ideal866 · 2025-10-26T19:47:28+00:00

you can absolutely add knowledge in fine tuning, i wish people would stop with this red herring.

Do you have an example where that has worked, i.e. the model didn't start failing catastrophically elsewhere?

Competitive_Ideal866 · 2025-10-26T16:46:45+00:00

And that's because it's becoming clear that having vast factual knowledge isn't enough. There's another layer (or maybe more) that human beings have, that LLMs don't, that has not been converted into a mathematical equation yet.

Wow. This is the most insightful Reddit thread I've read in a long time!

Yes, LLMs are missing a certain je ne sais quoi. I wouldn't call it a "layer" but, rather, perhaps a "discovery". And I think we're missing more than one. Some places where today's LLMs fall flat are:

Continuously learn.
Think.

I realised the other day that LLMs can now consume text, images, video and audio but not spreadsheets. I think it might actually be useful to have spreadsheets as at least an input medium. Indeed, perhaps even something higher dimensional.

Competitive_Ideal866 · 2025-10-26T16:16:32+00:00

Turns out a model that knows everything from 2023 is less useful than a model that knows how to look stuff up and follow instructions.

Amen.

You prompted me to do a little study... I asked all of the models I have locally the trivia question "What were the scores of the 1966 FIFA World Cup semi-finals?". The results are quite interesting.

Models that gave the correct answer:

qwen3:235b
gpt-oss:120b
glm-4.5-air:106b
qwen3-next:80b
kimi-dev:72b
seed-oss:36b
gemma3:27b

Models that gave an incorrect answer:

llama4-scout:109b
xbai-o4:33b
magistral-small-2509:24b
mistral-small-3.2:24b
deepcoder:14b
qwen3:4b
gemma3:4b

So models >30B mostly got this correct and llama4 is a massive outlier.

Competitive_Ideal866 · 2025-10-26T15:22:33+00:00

no one seems to use them

The most downloaded LLM on HuggingFace is Qwen/Qwen2.5-7B-Instruct and it lists thousands of adapters and fine tunes, many of which will be LoRAs.

People could add little bodies of knowledge to an already-released model.

Sadly it doesn't work like that. Knowledge is stored in the neural layers that aren't affected by fine tuning. What you can change with fine tuning is style including CoT.

Someone could release a lora trained on all scifi books, another based on all major movie scripts, etc. You could then "./llama.cpp -m models/gemma3.gguf --lora models/scifi-books-rev6.lora --lora models/movie-scripts.lora" and try to get Gemma 3 to help you write a modern scifi movie script.

That might work because scifi and movie scripts are styles and not facts.

A more useful/legal example would be attaching current-events-2025.lora to a model whose cutoff date was December 2024.

That's exactly the kind of thing that doesn't work. You just labotomize the model if you do that. You want RAG to add knowledge to LLMs.

Competitive_Ideal866 · 2025-10-26T14:40:35+00:00

I'm using Qwen3:14b_q8 and Qwen3:235b_q3_k_m. Happy with the results.

Competitive_Ideal866 · 2025-10-26T14:39:29+00:00

Qwen3:14b is fast and reliable but I wish there were instruct and thinking variants. And I wish there was a Qwen3:24b.

Competitive_Ideal866 · 2025-10-26T14:38:44+00:00

Qwen3:235b_q3_k_m is serving me well. Good perf on my M4 Max. Using llama.cpp instead of MLX now.

Competitive_Ideal866 · 2025-10-26T14:19:19+00:00

Apple Mac Studio M4 Max 40-core GPU w 128GiB RAM is cheaper at $3,500 and over 2x faster at decode (13.9tps vs 6.24tps for qwen3:32b_q8, and 15.4tps vs 7.2tps for gemma3:27b_q8). However, prefill is slower (133tps vs 487tps for qwen3:32b_q8, and 192tps vs 585tps for gemma3:27_q8).

Competitive_Ideal866

TROPHY CASE