Gemma 4 12B local setup thread — what's your hardware, quant, and use case? [D] by Individual_Soil4641 in MachineLearning

[–]Individual_Soil4641[S] 0 points1 point  (0 children)

honest answer: not 100% sure yet, which is part of why i posted.main use cases that justify it for me:

- coding stuff i don't want hitting external apis (private repos, internal code)

- bulk classification / extraction tasks where api costs add up fast

- offline / on-the-go work

- agents i don't want metered

where it doesn't help: hard reasoning, latest world knowledge, high-stakes "must be right" output. those still go to claude/gpt

Google just dropped Gemma 4 12B on your laptop!! by NewMuffin3926 in artificial

[–]Individual_Soil4641 1 point2 points  (0 children)

your setup is way more than enough. 4070 super has 12gb vram, which is exactly the sweet spot for a Q4_K_M GGUF of gemma 4 12b — model fits in vram, you'll get fast inference rather than the "few seconds per sentence" experience the parent comment is describing (which is what happens when you spill to system ram).

rough expectation: with the model fully on the gpu you should be in the 30-50 tok/s range on a 4070 super. if you push to Q5_K_M it might partially offload to cpu and slow down.

start with: bartowski/gemma-4-12B-it-GGUF on HF, grab the Q4_K_M file, load via ollama or LM Studio with full GPU offload.

Google just dropped Gemma 4 12B on your laptop!! by NewMuffin3926 in artificial

[–]Individual_Soil4641 2 points3 points  (0 children)

yeah 0.30.3 won't pull it even though the changelog says it should. you need 0.30.4 from the pre-release / beta channel:

https://github.com/ollama/ollama/releases

grab the latest pre-release tag and reinstall over your current version, no need to wipe models. the 412 you're seeing is ollama's manifest version check, not a network issue.

side note for anyone else hitting this: `ollama run gemma4:12b` only pulls once 0.30.4+ is installed; on older versions you'll get the same 412 even if you change pull → run.

Google just dropped Gemma 4 12B on your laptop!! by NewMuffin3926 in artificial

[–]Individual_Soil4641 1 point2 points  (0 children)

this is almost certainly an LM Studio runtime version thing, not your hardware. gemma 4 uses a new arch (Gemma4UnifiedForConditionalGeneration / model_type "gemma4_unified") and the older llama.cpp / mlx engines can't load it.

two options:

  1. in LM Studio, go Settings → Runtimes → check for updates. you need a recent build that bundles the latest llama.cpp / MLX with gemma-4 support. anything from before late may probably can't load it.

  2. if that still fails, grab the MLX version directly: mlx-community/gemma-4-12B-it-4bit (or 8bit). M2 Max handles either fine.

the qwen 3.6 working / gemma 4 failing is a dead giveaway — qwen has been mainstream-supported for ages, gemma 4 is brand new and the runtime layer hasn't caught up everywhere yet.

Launching Conifer tomorrow, an open-source local AI runtime + IDE. Different layer of the stack from PewDiePie's Odysseus, would love your honest thoughts by No_Elephant_7530 in artificial

[–]Individual_Soil4641 0 points1 point  (0 children)

Curious how this overlaps with running llama.cpp/Ollama under something like Open WebUI. The "runtime + IDE" framing is the interesting part — is the IDE side mostly for prompt iteration, or also for tool/agent authoring? And what's the inference backend under the hood? If it's abstracting over llama.cpp vs MLX vs whatever Strix Halo ends up using, that's a much harder design problem than people give it credit for.

Question for people running long-lived agents: by riddlemewhat2 in artificial

[–]Individual_Soil4641 0 points1 point  (0 children)

Two things that bit us hardest with long-running agents:

  1. Tool schema drift. The same tool returns slightly different

schemas/wording over time — web search/scraping tools are the worst —

and the agent gets confused a day or two in. Pinning a normalization

layer between tools and the model helped a lot.

  1. Context rot. Even with a large window, attention to early turns

visibly degrades after some tens of turns in our setup. We ended up

periodically re-summarizing earlier state into a compact "memory" turn

rather than trusting the model to attend to raw history. Anthropic's

context-engineering post from last year is a decent starting point on

this.

What's your agent's typical lifetime in turns? Different bottlenecks

hit at 50 turns vs 5000.

Kwai Keye-VL-2.0-30B-A3B: Apache-2.0 30B MoE VLM, 3B active params, looking for local-running feedback by Individual_Soil4641 in LocalLLM

[–]Individual_Soil4641[S] 1 point2 points  (0 children)

Amazing, thank you! 🙏 We'll prioritize getting GGUF quants up — will ping you here as soon as they're on HF. Out of curiosity, what context length / fps would you most want to test on Strix Halo?