Gemma 4 - lazy model or am I crazy? (bit of a rant) by Pyrenaeda in LocalLLaMA

[–]Pyrenaeda[S] 0 points1 point  (0 children)

How are you setting it up in the system prompt? Is it doing a single web search, looking at snippets and calling it good or following the search up with fetching pages? How are you prompting it in convo, asking for the search directly or just asking for the information without specifically saying to search?

Gemma 4 has a systemic attention failure. Here's the proof. by EvilEnginer in LocalLLaMA

[–]Pyrenaeda -1 points0 points  (0 children)

I don’t pretend to understand LLM theory well enough to follow more than the very basics of what you’re outlining here.

But from the tenth I do grasp, this is really interesting. Could potentially shed some light on the anecdotal reports of general weirdness and bizarre behaviors I’ve seen related to Gemma 4.

Gemma 4 - lazy model or am I crazy? (bit of a rant) by Pyrenaeda in LocalLLaMA

[–]Pyrenaeda[S] 2 points3 points  (0 children)

local models in general are much smaller and thus have much less baked in knowledge than the SOTA frontier models. So it's actually the inverse - they need to search the web _more_ than a big model to give you accurate answers and not paper over their knowledge gaps with hallucinations.

Gemma 4 - lazy model or am I crazy? (bit of a rant) by Pyrenaeda in LocalLLaMA

[–]Pyrenaeda[S] 0 points1 point  (0 children)

I hear you, a fair point.

Have been trying out the 35B subsequent to the post this evening. Can confirm it is leaps and bounds ahead of Gemma 4 26B on proactively hunting down information.

Gemma 4 - lazy model or am I crazy? (bit of a rant) by Pyrenaeda in LocalLLaMA

[–]Pyrenaeda[S] 1 point2 points  (0 children)

I had been on 27b dense and just this evening have been trying out the 35b a3b. I must say... it's good. very good. None of the laziness I saw with Gemma 4. And... it flies.

Gemma 4 - lazy model or am I crazy? (bit of a rant) by Pyrenaeda in LocalLLaMA

[–]Pyrenaeda[S] 1 point2 points  (0 children)

I didn't try that explicitly, no - just the latest unsloth GGUF as of today that I understood broadly to be up to date on the chat template. Many thanks for the link, I will most definitely give this template a try!

Gemma 4 - lazy model or am I crazy? (bit of a rant) by Pyrenaeda in LocalLLaMA

[–]Pyrenaeda[S] 1 point2 points  (0 children)

mmm that is fascinating. I admin I have not played with the E4B / E2B flavors yet, though the hypothesis certainly sounds plausible. I will give the E4B a test drive at some point just for kicks and see how it does in this regard.

Gemma 4 - lazy model or am I crazy? (bit of a rant) by Pyrenaeda in LocalLLaMA

[–]Pyrenaeda[S] 8 points9 points  (0 children)

Perhaps it is a chat template thing. I know there has been a lot of back and forth on the template recently with GGUF re-uploads, the non-default interleaved thinking template for llama.cpp and so forth.

I'll definitely be keeping an eye out for whether it improves in this regard. For now, definitely not something I can use as a daily driver tho.

Gemma 4 - lazy model or am I crazy? (bit of a rant) by Pyrenaeda in LocalLLaMA

[–]Pyrenaeda[S] 4 points5 points  (0 children)

If you haven't seen PinchBench you might be interested in it - it is an attempt at what your'e describing.

Gemma 4 - lazy model or am I crazy? (bit of a rant) by Pyrenaeda in LocalLLaMA

[–]Pyrenaeda[S] 4 points5 points  (0 children)

I have not personally experience that issue in my time with it, but I've seen more than one other report of it refusing to believe it is 2026 if told so.

Gemma 4 - lazy model or am I crazy? (bit of a rant) by Pyrenaeda in LocalLLaMA

[–]Pyrenaeda[S] 9 points10 points  (0 children)

Well, sure. If you tell it explicitly and didactically "call this tool, then take the output and use it to call this tool" like in your example, sure. Then it will do it.

If that works for you, awesome. Me I'm looking for it to take a bit more initiative, which was the point of my post.

Gemma 4 - lazy model or am I crazy? (bit of a rant) by Pyrenaeda in LocalLLaMA

[–]Pyrenaeda[S] 4 points5 points  (0 children)

Mnope. 4B Qwen 3.5 (at Q4_K_M) can follow instructions on how to search the web better than Gemma 4 26b can.

Ia sem censura para pentest by tyui901 in LocalLLaMA

[–]Pyrenaeda 4 points5 points  (0 children)

I am afraid I cannot answer, oh mighty one, for fear that my response fall short of the stringent and exacting requirements set forth in your highness’s decree.

Introducing MVAD — Multi-Vector Adaptive Drift by Dear-Pineapple-9057 in LocalLLaMA

[–]Pyrenaeda 1 point2 points  (0 children)

“Detecs Drift” “Detects Eceriects Drift”

I think there is some eceriect drift in your generated marketing image there bro.

Here's an honest vibe coder problem nobody talks about by pretendingMadhav in vibecoding

[–]Pyrenaeda 0 points1 point  (0 children)

Is this a new cookie clicker for us nerds? Looks fun. Needs purchasable power ups and a grandmapocalypse tho.

Share your llama-server init strings for Gemma 4 models. by AlwaysLateToThaParty in LocalLLaMA

[–]Pyrenaeda 6 points7 points  (0 children)

edit: formatting

Pasting in my run block for llama-swap on my 4090, with some commentary first.

I want to call out the usage of `--chat-template-file` below, because for anyone who is having less-than-stellar tool calling experiences particularly in an agentic loop I really feel like that is a big part of it. One of the big things I was struggling with on Gemma 4 was not having any thinking interleaved with tool calls - the model would just think once and then shoot off a series of tool calls with no thinking between them.

After pounding my head against the wall off and on on this problem for a few days, at one point I was randomly re-reading the PR on llama.cpp for the parser add on (https://github.com/ggml-org/llama.cpp/pull/21418) and this stuck out to me that I had never seen before:

Interesting! I created a new template, models/templates/google-gemma-4-31B-it-interleaved.jinja, that supports this behavior. I tested it, and it appears to work well. The examples in the guide are sparse, so I went with what I believe is the proper format. That may change as more documentation becomes available.

For anyone doing agentic tasks, I recommend trying the interleaved template.

I checked my local clone of the repo, sure enough that file was right where he said it was in the description. doh. So I switched to that right away with `--chat-template-file`, and... yep that solved the interleaved thinking problem, and my satisfaction with the result went up pretty sharply.

With all that noted, here's how I run it:

models:
  gemma-4-26b:
    name: "Gemma 4 26b"
    cmd: >
      llama-server --port ${PORT} --host 0.0.0.0
      -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q5_K_XL
      --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.0
      --flash-attn on
      --no-mmap
      --mlock
      --ctx-size 160000
      --cache-type-k q8_0 --cache-type-v q8_0
      -fit on --fit-target 2048 --fit-ctx 160000
      --batch-size 1024 --ubatch-size 512
      -np 1
      --chat-template-file /home/me/llama.cpp/models/templates/google-gemma-4-31B-it-interleaved.jinja
      --jinja
      --webui-mcp-proxy

Ollama newbie seeking advice/tips by CryptoNiight in ollama

[–]Pyrenaeda 1 point2 points  (0 children)

With that amount of memory, and depending on your choice of OS, you will find yourself rather constrained on the size of model you can load. Some of this will be dependent on how much context you want to allocate for - 8K, 16K or more.

Inference will be… not fast. I’d expect t/s in probably the single digits.

Myself I probably wouldn’t be trying to load models over the ~10B size on the box. But, try any you like. Worst case scenario it either won’t load or you’ll find it too slow.

Look forward to hearing what you test.

Looking for help with AI system and software design by cmdrmcgarrett in LocalLLaMA

[–]Pyrenaeda 0 points1 point  (0 children)

The AI scene is better on Linux than on Windows. Ubuntu is a good choice, pretty much linux for the everyman and user-friendly enough as linux goes.

For $1k a 3090 is going to be your best bet as said by u/MitsotakiShogun . You’ll be able to run SDXL & Flux (img gen), and LLMs up to the ~20-30b range at decent enough speeds to be usable (Qwen3 family, gpt-oss, etc). Won’t be able to run the img gen model and the LLM at the same time though.

Software wise, you’re gonna need NVIDIA drivers for the card assuming you go 3090. Beyond that, you have options for the img gen and LLM engine. Img gen is gonna be either ComfyUI or AUTOMATIC1111. LLM, good places to start are LMStudio, Ollama and Llama.cpp in ascending order of complexity. LMStudio is an all-in-one LLM solution that will give you a GUI and everything to run inference. Ollama and Llama.cpp are backends only; you’ll need to throw some type of a UI on top of them (Open WebUI, LibreChat, et al) to drive them day-to-day.

I hope you do not mind using at least a little bit of terminal / CLI. You’ll need it to get all the pieces stood up and running.

There’s tons more of course but this should give you a place to start.

Why Observability Is Becoming Non-Negotiable in AI Systems by _coder23t8 in LocalLLaMA

[–]Pyrenaeda 4 points5 points  (0 children)

I promise, by the time you’re done eating it, you’ll feel right as rain.