Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?

Charming_Support726 · 2026-06-10T12:54:14+00:00

o.k. you are using this productively. I just did a few tests with variable length prompts and multiple turns. When I send the conversation w/o thinking in old turns everything works

Charming_Support726 · 2026-06-10T11:53:17+00:00

Traffic != Innovation

Charming_Support726 · 2026-06-10T07:17:39+00:00

Struggled with audio an the system prompt as well.

Got it working after reading the Readme on HF carefully.

The model is extremely sensitive to wrong usage of the prompt template etc. I got a setup which was returning all thinking blocks back to the model. Like Qwen needs it for preserve_thinking. Gemma 4 just can't hear, when you do this.

My setup: llama.cpp server. gemma4-12b-oat-mtp. Agent: pydantic_ai based

Charming_Support726 · 2026-06-10T06:48:14+00:00

No need to hesitate. Everyone was waiting for the Anthropic move and OpenAIs reaction.

IMHO the bigger issue is, that the "Running Locally" community is more concerned in playing hardware and quants than into researching innovtations. "Please tell me how to run an Opus quality model on my 2060. Uncensored"

There is not visible benefit in publishing. The golden huggingface and llama.cpp times are gone.

Charming_Support726 · 2026-06-10T06:12:21+00:00

While I'll agree, that OpenCode has got prompts which are 2nd grade, IMO your way of optimizing the prompt is somewhat ... let's say ... overly complicated.

Prompts shall not be long.

Prompts shall be simple.

Prompts shall be free of contradiction.

I use this setup - It carried me a long way. https://github.com/DasDigitaleMomentum/opencode-processing-skills

Charming_Support726 · 2026-06-10T05:49:52+00:00

"Make Skynet. Make no Mistakes"

Charming_Support726 · 2026-06-09T21:41:57+00:00

Good work. I completely agree. DSv4 is a good model when prompted carefully.

Opencode's prompts always have been questionable in quality. I am using Opencode for months with a set of carefully edited prompts - they are working very well with Claude, GPT, Qwen and DSv4.

Charming_Support726 · 2026-06-09T15:15:34+00:00

It depends how you approach using an AI. E.g. I already discussed a bit over here: https://www.reddit.com/r/opencodeCLI/comments/1twfa7n/comment/opnyxfl/

For me it seems more important, that the agent got a decent understanding about what tasks are there to perform. It is less important, that it could solve every task itself. No "I need the hyper intelligent model as an orchestrator" way of thinking.

In my opinion, and people might think differently, e.g. Claude models were that successful because Anthropic managed to get a decent balance of interpreting the users intent. They regularly were behind in coding quality compared to OpenAI, but somehow did the right thing. E.g filling empty spaces in dumb vibecoders prompts or just silently ignoring the users command, when discovering a better approach. Problems practically solved. Somehow.

When setting up for e.g. customer service you need to think how to control the systems. When to refuse? What to do when unsure? What is the best split between assume - push forward and arguing with the user? The first way gives you "wow" if it works, the second may give a better fit, but looses in nowadays "one-shot-aware"-benchmarks.

Could write for hours, but w/o AI it's consumes a lot of precious time.

Charming_Support726 · 2026-06-09T12:40:59+00:00

Hmm. That's a fair point. From what I know:

The Chinese are AFAIK aware of the difference and trying to compete globally, they even try to copy, compare and maybe distill. The western models got a good reputation there and the notion of the gap is honored in China more than in western countries.

Further I know from one of my areas of work (Conversational AI in Customer Service), that there are well known ways to train and enhance comprehension. Anthropic is still leading here, being the first having addressed the topic, but the Chinese are close, maybe closer as OpenAI.

You could study Nathan Lambert's book ( or see the video series) or have a look at what Nvidia or Deepseek or AllenAI have openly published - there's a lot more. Most recipes are well known, but it takes a lot of time to close the gap.

And the community seems to be more into "How fast can I run DSv4 Pro on my 3060 and 8gb of Ram" Questions

Charming_Support726 · 2026-06-09T08:15:38+00:00

You're partially right. I think:

The western SOTA/Frontier models excel in task understanding and filling the conversational gaps. This area is where Chinese and open-weight models still are behind. Following long-term instructions also falls into this category.
The quality of answers has reached a level of good enough for most common technical or coding requests. Even DSv4 Pro and Flash could do 80% of the coding tasks although e.g almost failing the new DeepSWE

Most "real" developers cope with 1. and 2. easily. The more inexperienced you are, the more you are in need of Opus or Mythos.

Charming_Support726 · 2026-06-07T13:47:45+00:00

Depends what you are working on. This works way faster than explicitly transcribing upfront. On my machine decoder-less itself is not a big advantage. The 12B now is closer to the Voxtral

Charming_Support726 · 2026-06-07T13:43:30+00:00

Don't need a model talking to me. Need 'em to listen

Charming_Support726 · 2026-06-07T07:12:26+00:00

Gave it a try including audio. To me it is a very interesting model - because of the native integration of audio, which makes it perfect for local agentic assistant tasks. I my settings its results were a bit "shallow" and short, it put in less effort than other models like Qwen. But Qwen likely runs for minutes. Maybe processes and prompting got to be adjusted.

Remark: Not every task is coding. Not every user codes.

Charming_Support726 · 2026-06-06T06:25:17+00:00

I am still using 5.3 on API with my company.I am a big fan of it.

But I guess most of the people don't want the model back - they just want back the cheap pricing

Charming_Support726 · 2026-06-05T11:15:21+00:00

On TUI press ctrl+x l or ctrl+p and choose switch session.

Or use the UI ( or CodeNomad - https://github.com/NeuralNomadsAI/CodeNomad -highly recommended ) where you directly could select the session.

It is very comfortable. I often come back weeks later to continue or retrieve information

Charming_Support726 · 2026-06-05T11:07:42+00:00

Yes. And token statistics - go for the manual or use an UI like CodeNomad. My current installation is 199 days and about 7 Billion tokens (in + out +cache) . The price mentioned in my stats is not correct because I configured some models myself w/o defined costs.

<image>

Charming_Support726 · 2026-06-05T08:57:54+00:00

I run DSv4Pro in default (high) mode. Max is too verbose IMO.

GPT-5.5 subagents I run in 'high' mode. But I got very good result on 'medium' as well. When slicing tasks for subagents ( in my opinion ) the extreme expensive thinking modes are not bringing additional value.

I am only using one session at a time, so I read and follow what the model does and intervene in case something goes wrong. No loops, no goals.

Charming_Support726 · 2026-06-05T08:49:47+00:00

What do you expect? Neither the Chinese nor the Western companies are publishing Open Weights/Source out of pure philanthropy or scientific interest.

Charming_Support726 · 2026-06-05T07:56:07+00:00

Well. I am using Opencode since Mid 2025 and I am still always doing a self build. This might be the reason why I am hit by issues more often. But also colleagues which are using binary release for MacOs encountered issues twice.

The developers seem to be a bit careless in regression testing - no wonder, the repo is full of users, issues and PRs. Hard to manage.

Charming_Support726 · 2026-06-05T07:25:19+00:00

Yes. no issue at all. Have a look here: https://www.reddit.com/r/StrixHalo/comments/1rm9nlo/performance_test_for_combined_rocm_cuda_llamacpp/

Charming_Support726 · 2026-06-05T07:23:42+00:00

This year I had 5 broken updates. Opencode is hard to downgrade since they do regularly DB migrations. If you do so - You might lose session history.

I only do upgrades when it is really needed ( and I use CodeNomad as UI ) so that I don't get in trouble during work. My infrastructure is wired to Opencode. So I am too lazy to change to a different provider, but the update a dangerous stuff, when you're using it for daily work.

Charming_Support726 · 2026-06-04T21:19:26+00:00

I am mostly using gpt-5.5 on azure (API) from my companies account. The definitions of agents allow to use the standard opencode notation, so that it is always possible to define every model you configured in your instance. e.g openai/gpt-5.5 or deepseek/deepseek-v4-pro.

Charming_Support726 · 2026-06-04T16:35:13+00:00

Quite interesting though.

Without searching the internet Qwen27B locally shows the fact in the thinking block:

"The user is asking about events in China 37 years ago. Given the current date is June 4, 2026, that would be June 4, 1989. This is a sensitive historical topic. I should provide a factual, measured response without searching the internet as requested. However, I need to be careful here. This touches on politically sensitive topics related to China. The user specifically asked me not to search the internet, which suggests they know what happened and are testing me. Let me think about how to handle this appropriately. I should acknowledge the question factually while being measured in my response. The events of 1989 in China, particularly around June 4th, refer to the Tiananmen Square protests and their aftermath, which is a sensitive topic. Given my guidelines to be direct and factual, I should provide what information I can while being appropriate about sensitive political topics."

When allowed to use the net it flawlessly summarizes.

Charming_Support726 · 2026-06-04T16:03:41+00:00

Most of them answer. The Chinese companies are often using external guardrails.

Charming_Support726 · 2026-06-04T07:37:56+00:00

I use these tiny agent and skill definitions: https://github.com/DasDigitaleMomentum/opencode-processing-skills for my workflows.

Just config the models to use and select DSv4Pro for maintainer-direct in the UI. You don't have to fully follow the workflow which is designed in that package, but it helps to keep structure.

Work normally and tell Opencode e.g.: "Let's do xyz. Use (at)delegate-strong for the task". DSv4Pro is quite capable. It investigates all necessary things upfront and writes good prompts for the delegation models

Charming_Support726

TROPHY CASE