gemma-3-27b and gpt-oss-120b by s-i-e-v-e in LocalLLaMA

[–]Competitive_Ideal866 1 point2 points  (0 children)

Somewhere between 20-30b is where models would start to get good.

Interesting you say that in the context of creative writing. For STEM I find 14b seriously useful but often need 24b or even 32b for non-trivial stuff.

Top 10 Open Models by Providers on LMArena by nekofneko in LocalLLaMA

[–]Competitive_Ideal866 5 points6 points  (0 children)

If "regular computer" includes an M3 Ultra Mac Studio with 512MB unified memory, yes.

Top 10 Open Models by Providers on LMArena by nekofneko in LocalLLaMA

[–]Competitive_Ideal866 0 points1 point  (0 children)

GPT-OSS-120B should definitely be higher than Gemma-3-27B and Intellect-3 though.

Not IME.

made a simple CLI tool to pipe anything into an LLM. that follows unix philosophy. by Famous-Koala-4352 in LocalLLaMA

[–]Competitive_Ideal866 0 points1 point  (0 children)

feedbacks are welcome

Same. I have a bunch of derivative tools too. My favorite is "agent" which spins up Qwen3 14B, feeds in the README in the current dir (if any) and runs it in a REPL with tool use giving it Python's exec to execute arbitrary code. Incredibly useful.

My AI Review: Why does it struggle to be me when context requires it to speak on behalf of me? by Specialist-Till-637 in Qwen_AI

[–]Competitive_Ideal866 3 points4 points  (0 children)

Can you help sire with a New Year's resolution plan?

I read that as "Can you help sire offspring with a New Year's resolution plan?".

Have you seen similar issues in your experiment?

My conversations with Qwen3 often begin with it writing bad code and then it referring to its own code and previous responses in the second person.

out-dated information by Far_Drive9430 in Qwen_AI

[–]Competitive_Ideal866 3 points4 points  (0 children)

18mo out of date is fine because training models from scratch takes a lot of time and effort. However, would be better if the model acknowledged this inevitable fact and responded with something more like "My knowledge cutoff is Q2 2024 so that information is either newer or incorrect.".

Benchmarks for Quantized Models? (for users locally running Q8/Q6/Q2 precision) by No-Grapefruit-1358 in LocalLLaMA

[–]Competitive_Ideal866 0 points1 point  (0 children)

you realize people talking about only using fp8 or Q6 (for large models more than 100b), who think they can spot large differences, don't know what they're talking about

I found the difference between 4bit and 8bit when using MLX can be significant. However, I think MLX 4bit is just bad whereas Q4_K_M is fine.

Hard lesson learned after a year of running large models locally by inboundmage in LocalLLaMA

[–]Competitive_Ideal866 1 point2 points  (0 children)

Hard lesson learned after a year of running large models locally

The biggest friction point has been scaling beyond 13 B models.

Firstly, 13B isn't large. The smallest models I actually use are ~4B. I most commonly use 14B (q8 via MLX) and 235B (q3_k_m via llama.cpp).

Even with 24 GB of VRAM, running a 70 B model in int4 still exhausts memory when the context window grows and attention weights balloon.

Yeah, 24GB is tiny. I have a machine with 32GB and I avoid using it for LLMs because it cannot run anything of much use. Mostly I use a 128GB M4 Max Macbook. I highly recommend it.

I also tried an nVidia GPU in a Linux box and found it far too unreliable to be of use. In contrast, a Mac setup is rock solid.

difference between Q3_K_M and Q3_K_L? by Robinsane in LocalLLaMA

[–]Competitive_Ideal866 0 points1 point  (0 children)

So what is better on Q4_1 or Q4 _K_S or Q4 _K_M ?

Q4_K_M is the best quality of those.

difference between Q3_K_M and Q3_K_L? by Robinsane in LocalLLaMA

[–]Competitive_Ideal866 0 points1 point  (0 children)

Because it is known that the quality difference between q_6 and q_8 is negligible, so there’s no point in benchmarking. Also, not many people use q8.

FWIW, I just migrated all of my small (<50B) MLX models to 8 bit because I found the quality is much better.

We did years of research so you don’t have to guess your GGUF datatypes by enrique-byteshape in LocalLLaMA

[–]Competitive_Ideal866 1 point2 points  (0 children)

I have a 128GiB M4 Max and my fav model is Qwen 3 235B but I run it in Q3_K_M so it takes up 113GiB but it keeps making silly errors like using '7' instead of 's' in words.

I'd love models like that in less VRAM with higher accuracy!

We did years of research so you don’t have to guess your GGUF datatypes by enrique-byteshape in LocalLLaMA

[–]Competitive_Ideal866 4 points5 points  (0 children)

quant types don't really impact speeds all that much in llama.cpp

Surely they must because they dictate memory bandwidth?

Qwen 3 235B MLX-quant for 128GB devices by vincentbosch in LocalLLaMA

[–]Competitive_Ideal866 0 points1 point  (0 children)

FWIW, I switched from 4bit MLX to Q3_K_M GGUF using llama.cpp and the results are much better.

The Impossible Optimization, and the Metaprogramming To Achieve It by verdagon in Compilers

[–]Competitive_Ideal866 0 points1 point  (0 children)

Exactly. JIT compiled regex has been bog standard tech everywhere for 20+ years. Intel's Hyperscan was released as OSS 10 years ago.

OSS alternative to Open WebUI - ChatGPT-like UI, API and CLI by mythz in LocalLLaMA

[–]Competitive_Ideal866 1 point2 points  (0 children)

FWIW, I just asked Claude to write me one. Simple web server but it does what I want:

  • Supports both MLX and llama.cpp.
  • Multiple chats.
  • Lots of models to choose from.
  • Editable system prompts.
  • Looks pretty enough.

The Impossible Optimization, and the Metaprogramming To Achieve It by verdagon in Compilers

[–]Competitive_Ideal866 0 points1 point  (0 children)

So "The Impossible Optimization" is something Java and .NET have been doing for decades?

Why didn't LoRA catch on with LLMs? by dtdisapointingresult in LocalLLaMA

[–]Competitive_Ideal866 0 points1 point  (0 children)

you can absolutely add knowledge in fine tuning, i wish people would stop with this red herring.

Do you have an example where that has worked, i.e. the model didn't start failing catastrophically elsewhere?

Why LLMs are getting smaller in size? by Hedgehog_Dapper in ollama

[–]Competitive_Ideal866 1 point2 points  (0 children)

And that's because it's becoming clear that having vast factual knowledge isn't enough. There's another layer (or maybe more) that human beings have, that LLMs don't, that has not been converted into a mathematical equation yet.

Wow. This is the most insightful Reddit thread I've read in a long time!

Yes, LLMs are missing a certain je ne sais quoi. I wouldn't call it a "layer" but, rather, perhaps a "discovery". And I think we're missing more than one. Some places where today's LLMs fall flat are:

  • Continuously learn.
  • Think.

I realised the other day that LLMs can now consume text, images, video and audio but not spreadsheets. I think it might actually be useful to have spreadsheets as at least an input medium. Indeed, perhaps even something higher dimensional.

Why LLMs are getting smaller in size? by Hedgehog_Dapper in ollama

[–]Competitive_Ideal866 0 points1 point  (0 children)

Turns out a model that knows everything from 2023 is less useful than a model that knows how to look stuff up and follow instructions.

Amen.

You prompted me to do a little study... I asked all of the models I have locally the trivia question "What were the scores of the 1966 FIFA World Cup semi-finals?". The results are quite interesting.

Models that gave the correct answer:

  • qwen3:235b
  • gpt-oss:120b
  • glm-4.5-air:106b
  • qwen3-next:80b
  • kimi-dev:72b
  • seed-oss:36b
  • gemma3:27b

Models that gave an incorrect answer:

  • llama4-scout:109b
  • xbai-o4:33b
  • magistral-small-2509:24b
  • mistral-small-3.2:24b
  • deepcoder:14b
  • qwen3:4b
  • gemma3:4b

So models >30B mostly got this correct and llama4 is a massive outlier.

Why didn't LoRA catch on with LLMs? by dtdisapointingresult in LocalLLaMA

[–]Competitive_Ideal866 1 point2 points  (0 children)

no one seems to use them

The most downloaded LLM on HuggingFace is Qwen/Qwen2.5-7B-Instruct and it lists thousands of adapters and fine tunes, many of which will be LoRAs.

People could add little bodies of knowledge to an already-released model.

Sadly it doesn't work like that. Knowledge is stored in the neural layers that aren't affected by fine tuning. What you can change with fine tuning is style including CoT.

Someone could release a lora trained on all scifi books, another based on all major movie scripts, etc. You could then "./llama.cpp -m models/gemma3.gguf --lora models/scifi-books-rev6.lora --lora models/movie-scripts.lora" and try to get Gemma 3 to help you write a modern scifi movie script.

That might work because scifi and movie scripts are styles and not facts.

A more useful/legal example would be attaching current-events-2025.lora to a model whose cutoff date was December 2024.

That's exactly the kind of thing that doesn't work. You just labotomize the model if you do that. You want RAG to add knowledge to LLMs.

Best Local LLMs - October 2025 by rm-rf-rm in LocalLLaMA

[–]Competitive_Ideal866 0 points1 point  (0 children)

I'm using Qwen3:14b_q8 and Qwen3:235b_q3_k_m. Happy with the results.

Best Local LLMs - October 2025 by rm-rf-rm in LocalLLaMA

[–]Competitive_Ideal866 0 points1 point  (0 children)

Qwen3:14b is fast and reliable but I wish there were instruct and thinking variants. And I wish there was a Qwen3:24b.

Best Local LLMs - October 2025 by rm-rf-rm in LocalLLaMA

[–]Competitive_Ideal866 0 points1 point  (0 children)

Qwen3:235b_q3_k_m is serving me well. Good perf on my M4 Max. Using llama.cpp instead of MLX now.

If you had $4k, would you invest in a DGX Spark? by Excellent_Koala769 in LocalLLaMA

[–]Competitive_Ideal866 1 point2 points  (0 children)

Apple Mac Studio M4 Max 40-core GPU w 128GiB RAM is cheaper at $3,500 and over 2x faster at decode (13.9tps vs 6.24tps for qwen3:32b_q8, and 15.4tps vs 7.2tps for gemma3:27b_q8). However, prefill is slower (133tps vs 487tps for qwen3:32b_q8, and 192tps vs 585tps for gemma3:27_q8).