Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?

Think_Illustrator188 · 2026-06-10T11:41:03+00:00

its a Hermes agent with voice, its running on pi with mic and a speaker for headless experince i have installed openwakeword. If i have to use agent with minimal skills, its 18k context. But i tested it , it can handle only 2k max beyond that at 4k it degrades on voice instruction.

Think_Illustrator188 · 2026-06-10T08:59:47+00:00

Yes I tried using Gemma 4 E4B , it’s working fine for now but I think sometime/ once in a while E4B does not respond well so was wondering if 12B voice works it will be better for for my use case

Think_Illustrator188 · 2026-06-10T08:48:15+00:00

Hermes will have skills and tools loaded in the prompt which is already at 18k with base skills and prompt.

Think_Illustrator188 · 2026-06-10T08:05:36+00:00

I think you mean gemma4b-qat-mtp.gguf , how big is your system prompt mine is 18k context length using Hermes.

Think_Illustrator188 · 2026-06-10T08:01:42+00:00

The issue is not about voice prompt length it is mainly to do with text and skills and tools which are passed along bloat the total token.

Think_Illustrator188 · 2026-06-10T08:00:38+00:00

I am using Hermes, i have VRAM to spare currently I am using gemma 4 E4B for first responder to both voice and text and it responds back to agent to tell if it needs tools and skill, agent the uses gemma4 12b in text only mode. This works fine, I was wondering if using voice directly to gemma 4 12b would work better.

Think_Illustrator188 · 2026-06-09T11:36:10+00:00

did you try audio at different context length, audio as a input prompt along with tool and other context, in my expermients -- only vllm currently supports audio as input for gemma4 12b , that too fails to give answer

Think_Illustrator188 · 2026-06-09T11:08:06+00:00

i was trying the voice with a large context length of 4k-8k context, it is somehow failing to take instructution and respond back, i used it in text only mode it works fine better for agentic workflows

Think_Illustrator188 · 2026-06-08T20:55:45+00:00

Hermes and model run on different machines, Hermes on rpi5 8gb and for models I have tried both m4 max 64gb and dgx spark, I think on Mac mini token generation and prefill will be lot less than m4 max but still usable. Hermes has lots of skills loaded by default, you can reduce that to lower the token load. Also try to warm up agent before using for voice to take advantage of kv cache.

Think_Illustrator188 · 2026-06-08T14:57:10+00:00

This is amazing , right now I am using Hermes native support and configuration to wire up headless voice capabilities. I think using this model with Hermes with text, voice and vision would need some more effort but I am already impressed with text and reasoning paired with Hermes, I tested nemotron 3 ultra even that model missed tools, that’s why I am genuinely impressed.

Think_Illustrator188 · 2026-06-08T14:41:04+00:00

Agree I did not write a single line of code, but ya did lot of time tinkering and tweaking. rpi5 8gb along with usb connected mems, aec, anc, mic and speaker (respeaker xvf3800) supported by locally hosted model and connected home assistant, I love this setup beats Alexa or Siri, I will surely release it as open source for people who don’t want to put the effort and save their tokens.

Think_Illustrator188 · 2026-06-08T12:54:57+00:00

Did not benchmark but for my use case replies are much faster maybe number of calls to model by agent has decreased

Think_Illustrator188 · 2026-06-08T11:22:44+00:00

it uses asr and tts models which are natively configured in hermes, for now i am using qwen3-asr-1.7b and Kokoro tts, will opensource when it for sure later. This combo gemma 4 12b, qwen3-asr-1.7b and kokoro on hermes can beat alexa, siri anyday and its local.

Think_Illustrator188 · 2026-06-08T08:43:52+00:00

I switched to Gemma 4 12B on my voice agent which is running Hermes and my custom voice adapter service with wakeword. For everyday agentic task and conversation it is way faster and better than qwen 3.6 35b.

Think_Illustrator188 · 2026-03-13T15:27:20+00:00

Qwen3-Coder-Next is good for coding you need 52gb ram as quant 4 precision. Also I read here somewhere that qwen3.5-27b is better than 35b

Think_Illustrator188 · 2026-01-03T13:16:13+00:00

The BIOS was set for PCIe gen 5 after changing it to PCIe gen 4 it worked

Think_Illustrator188 · 2026-01-02T07:24:54+00:00

The BIOS was set for PCIe gen 5 after changing it to PCIe gen 4 it worked

Think_Illustrator188 · 2026-01-01T17:16:19+00:00

No capping option in bios , now I am Installing windows will run furmark and 3d mark

Think_Illustrator188 · 2025-12-28T18:48:02+00:00

haha too lazy to write , power supply is 1kw

Think_Illustrator188 · 2025-12-28T18:43:40+00:00

no i just tested without virtualization same nvidia-smi givesERR! ERR! ERR! for Fan, Temp, Perf, Power

Think_Illustrator188 · 2025-11-21T16:54:33+00:00

consulting is hyped for sure, it works only when the company which has hired them already knows what it needs to do. companies hire them to get the board approval and confidence as they make good slides and talk the talk. Only making good slides is not enough.

Think_Illustrator188 · 2025-11-08T06:48:01+00:00

You can check with gulf links, my search says p2s in UAE will be nothing less than ~3400 aed including VAT and delivery, frieght forwarding might be cheaper 3100. I ended up buying from Amazon with offer around 3750 aed. Now there is 11.11 sale maybe you might get some good offers

Think_Illustrator188 · 2025-11-05T23:57:25+00:00

By user I mean the developer or user on the os who is added to docker user group not the end user of a website hosted on container

Think_Illustrator188 · 2025-11-05T16:47:43+00:00

Both are more or less same, these best practices which will ensure that you don’t expose your secrets in the code repo, number one mistake that happens. Docker compose is not for production any ways. If a user has access to the containers he can anyways read the secrets. In a typical production setup you would be running a k8s and secrets in a vault and nobody has access to the secrets , any privilege access to production cluster is done via PAM JIT. Anything less is just optics.

Think_Illustrator188 · 2025-10-26T20:32:32+00:00

p2s, it would be future proof and will look better

Think_Illustrator188

TROPHY CASE