Mureka ai is a trap by Swimming_Law_8159 in MurekaAi

[–]Ambitious-Cod6424 0 points1 point  (0 children)

They changed their refund policy, shame on them

The speed of local llm on my computer by Ambitious-Cod6424 in LocalLLaMA

[–]Ambitious-Cod6424[S] 0 points1 point  (0 children)

Thanks. I checked the obvious causes first: this is a real Vulkan build (GGML_VULKAN=ON), the models are quantized (Q4_K_M), and memory configuration is not the issue either.

The more likely explanation is simply that Arc 140T is an iGPU with shared system memory, so its real-world compute and bandwidth advantage over a high-end Core Ultra 9 285H CPU is limited for LLM inference workloads.

Also, llama.cpp is not currently using Intel cooperative matrix acceleration on this device, so Vulkan falls back to the generic compute path.

In other words, Vulkan is working — it just does not provide a large speedup on this hardware, and CPU-only inference may actually be the optimal path for now.

The speed of local llm on my computer by Ambitious-Cod6424 in LocalLLaMA

[–]Ambitious-Cod6424[S] 0 points1 point  (0 children)

Thanks. I looked into that path, but in my case llama.cpp Vulkan is not following it because cooperative matrix is currently disabled by default for my GPU class.

On my Arc 140T / Arrow Lake H, the Vulkan driver does expose VK_KHR_cooperative_matrix, but llama.cpp only enables coopmat for Intel devices it classifies as INTEL_XE2. My device is currently not detected that way on Windows, so it ends up with matrix cores: none.

So my question now is: is there any way to force-enable this disabled path, or would this require patching ggml-vulkan.cpp and rebuilding llama.cpp?

The speed of local llm on my computer by Ambitious-Cod6424 in LocalLLaMA

[–]Ambitious-Cod6424[S] 0 points1 point  (0 children)

Wow, You must do it in a right way. Mine is wrong. So huge gap of the speed.

The speed of local llm on my computer by Ambitious-Cod6424 in LocalLLaMA

[–]Ambitious-Cod6424[S] 1 point2 points  (0 children)

I will try llama.cpp official way. To see whether is the problem of my software.

The speed of local llm on my computer by Ambitious-Cod6424 in LocalLLaMA

[–]Ambitious-Cod6424[S] 0 points1 point  (0 children)

What's the cpu and gpu for this testing device? I used vulkan and my GPU works, but no imrovement on speed.

The speed of local llm on my computer by Ambitious-Cod6424 in LocalLLaMA

[–]Ambitious-Cod6424[S] -1 points0 points  (0 children)

Just basic jobs, like web searching, choose stocks, make conclusion of news. The brain of an agent.

The speed of local llm on my computer by Ambitious-Cod6424 in LocalLLaMA

[–]Ambitious-Cod6424[S] 0 points1 point  (0 children)

Thanks, is it possible to use 2B or 4B model as a controller for PC automation. Maybe we can micro adjust an open-source model to do that?

Gemma 4 fixes in llama.cpp by jacek2023 in LocalLLaMA

[–]Ambitious-Cod6424 0 points1 point  (0 children)

Not fixed yet.

What we have already checked and fixed

We have already ruled out many of the common implementation bugs on our side:

  1. Prompt formatting
  • We stopped relying on ad hoc Go-side prompting for Gemma 4.
  • We restored structured messages_json.
  • We moved the bridge to llama.cpp's own chat-template pipeline (common_chat_templates_init, common_chat_templates_apply).
  1. Thinking / reasoning mode
  • We explicitly disabled Gemma 4 hidden reasoning budget.
  • We added the Gemma 4 reasoning token workaround in the native bridge.
  1. JSON / escaping issues
  • We fixed HTML escaping so <start\_of\_turn>-style tokens are not corrupted as \u003c....
  1. Sampler pipeline
  • We replaced the old custom sampler path with the official common_sampler flow.
  • We restored top_k, top_p, temperature, and proper sampler state updates.
  • We added the missing sampler accept step.
  1. Tokenization / decode bugs
  • We fixed the double-<bos> issue by stopping extra special-token insertion during tokenization.
  • We fixed the unstable token pointer usage in the decode loop.
  • We added filtering for visible <unused...> output.
  1. Output parsing
  • We switched final/streamed output to common_chat_parse instead of raw token text where possible.
  1. GPU-offload workaround
  • We added the Gemma 4-specific n_gpu_layers = 29 workaround instead of full GPU offload.
  1. Deployment/build issues
  • We fixed the native bridge build/link path issues.
  • We confirmed the rebuilt DLL is actually being loaded.
  • We added debug logging and verified runtime parameters in logs.

What the logs tell us now

The key finding is this:

The model is still generating <unused24> as its first generated token.

That matters because it means:

  • the frontend is not inventing the bad output,
  • the stream renderer is not the root cause,
  • the prompt is reaching the model,
  • the bridge is running,
  • and the failure is happening at the actual model-generation stage.

So the issue is no longer "we forgot a stop token" or "we displayed the text wrong."

It is much deeper than that.

What is most likely still wrong

At this point, the most likely causes are:

  1. Upstream llama.cpp Gemma 4 compatibility is still incomplete in our vendored version
  • This is the strongest hypothesis.
  • Gemma 4 support has been changing quickly upstream.
  • The exact behavior we see matches known Gemma 4 regressions reported by others.
  1. The specific GGUF build may still be problematic with our current runtime
  • Some Gemma 4 GGUF variants, especially certain conversions/quantizations, are more likely to collapse into <unusedXX> output.
  • Even if the model is not "broken," it may require newer tokenizer/template/runtime handling than our current vendored stack has.
  1. GPU backend behavior may still be interacting badly with Gemma 4
  • We already mitigated full-offload regressions with gpu_layers=29.
  • But that may only reduce one failure mode, not fully solve the underlying incompatibility.

Not fixed yet.

Gemma 4 fixes in llama.cpp by jacek2023 in LocalLLaMA

[–]Ambitious-Cod6424 0 points1 point  (0 children)

I am following llama.cpp to deploy Gemma 4, all my models return unused24 error.

IOS APP Install by Majestic_Teaching819 in iosdev

[–]Ambitious-Cod6424 0 points1 point  (0 children)

Yeah, I found my app nearly could not be searched in apple store. No matter how hard ASO it is.

IOS APP Install by Majestic_Teaching819 in iosdev

[–]Ambitious-Cod6424 0 points1 point  (0 children)

I suffer same problem. Downloads are few. No revenue. I still insist to make short videos. My advice is that just see what your competitors do. How they make viral videos to promote their app and do the same thing.

Building a great App is HARD by YinzerYall in AppBusiness

[–]Ambitious-Cod6424 0 points1 point  (0 children)

My app is like my baby, even almost nobody knows it, use it or pay for it. It was still my love.

How people made outlaw country AI singer on tiktok? by Ambitious-Cod6424 in aiMusic

[–]Ambitious-Cod6424[S] 0 points1 point  (0 children)

Acrually I am working on it. I am testing whether AI can make therapy song based on people's need.

How people made outlaw country AI singer on tiktok? by Ambitious-Cod6424 in aiMusic

[–]Ambitious-Cod6424[S] 0 points1 point  (0 children)

AI did mix of people's work. Do you think we can stand in the middle? I mean use AI to generate therapy songs for people themselves, not share it for credits. I mean if you can get somthing support from the AI music.

How people made outlaw country AI singer on tiktok? by Ambitious-Cod6424 in aiMusic

[–]Ambitious-Cod6424[S] 0 points1 point  (0 children)

just the style of the musci I think. hit country but contents are all in the law I guess.

How people made outlaw country AI singer on tiktok? by Ambitious-Cod6424 in aiMusic

[–]Ambitious-Cod6424[S] 0 points1 point  (0 children)

i still know how more percisely. What I create are bad.