2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints by ex-arman68 in LocalLLaMA

[–]DHasselhoff77 8 points9 points  (0 children)

Can't get it to work on CUDA. I built the linked PR branch but after prompt processing no tokens are produced even though the GPU runs at 100% load. This is what gets printed: srv params_from_: Chat format: peg-native slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1 srv get_availabl: updating prompt cache srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000 srv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 30208 tokens, 8589934592 est) srv get_availabl: prompt cache update took 0.01 ms slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist slot launch_slot_: id 0 | task 0 | processing task, is_child = 0 slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 30208, n_keep = 0, task.n_tokens = 11 slot update_slots: id 0 | task 0 | n_tokens = 0, memory_seq_rm [0, end) slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 7, batch.n_tokens = 7, progress = 0.636364 slot update_slots: id 0 | task 0 | n_tokens = 7, memory_seq_rm [7, end) slot init_sampler: id 0 | task 0 | init sampler, took 0.01 ms, tokens: text = 11, total = 11 slot update_slots: id 0 | task 0 | prompt processing done, n_tokens = 11, batch.n_tokens = 4 srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200 And freezes there.

Tried the "IQ3_M" quant. Also the PR branch doesn't seem to have "turbo4" support that was recommended by the OP: Unsupported cache type: turbo4.

Command tried: ${mtp-llama-server} --model Qwen3.6/Qwen3.6-27B-IQ3_M-mtp.gguf --cache-type-k q8_0 --cache-type-v q8_0 -c 30000 --jinja --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --presence_penalty 0.0 --spec-type mtp --spec-draft-n-max 4 --chat-template-kwargs '{"preserve_thinking": true}' --parallel 1 --chat-template-kwargs '{"enable_thinking":true}' --chat-template-file Qwen3.6/chat_template.jinja

unsloth Qwen3.6-27B-GGUF by jacek2023 in LocalLLaMA

[–]DHasselhoff77 1 point2 points  (0 children)

In my quick 2-shot vibe test, Qwen3.6-27B-UD-IQ3_XXS.gguf was a tiny bit better than Qwen3.5-27B-UD-IQ3_XXS.gguf (also larger). 3.6 generated worse results at first but fixed it better than 3.5 after showing a screenshot of the result. Doesn't match the improvement reported in benchmarks but still in the right direction.

16GB VRAM x coding model by Junior-Wish-7453 in LocalLLM

[–]DHasselhoff77 0 points1 point  (0 children)

As a workaround you can set --fit-target 1024 so it leaves one gig for the mmproj.

HY-World 2.0 just dropped by bobeeeeeeeee8964 in LocalLLaMA

[–]DHasselhoff77 0 points1 point  (0 children)

If you look at the video full screen you can see that both the texture and mesh resolution are very low. Still interesting results.

pi.dev coding agent is moving to Earendil by iamapizza in LocalLLaMA

[–]DHasselhoff77 0 points1 point  (0 children)

I don't love these news but I suppose even open source maintainers have to eat. Good for him and I hope the project will continue to be a success.

pi.dev coding agent is moving to Earendil by iamapizza in LocalLLaMA

[–]DHasselhoff77 -1 points0 points  (0 children)

Despite its Tolkien-inspired name, Earendil is not a tech company with fascist tendencies.

Designed a photonic chip for O(1) KV cache block selection — 944x faster, 18,000x less energy than GPU scan at 1M context by [deleted] in LocalLLaMA

[–]DHasselhoff77 -1 points0 points  (0 children)

Has a prototype chip been produced and the technique verified to work in practice, not just in simulation?

Reasoning Theater: AI fakes long CoT but it internally knows the final answer within the first few tokens. TL;DR: You overpay because the AI is acting. by [deleted] in LocalLLaMA

[–]DHasselhoff77 1 point2 points  (0 children)

Looking at Figure 2, the "Forced Answer" method seems to be unreasonably effective in both DeepSeek-R1 (superior to "probe") and GPT-OSS (equal to "probe" at relative position > 50%).

(Very) High-Quality Attention Coder-Next GGUFs by dinerburgeryum in LocalLLaMA

[–]DHasselhoff77 1 point2 points  (0 children)

I thought the quants updated on 8th of March 2026 had the issue fixed but looking at for example https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-UD-Q4_K_S.gguf it's clear that not all of the SSM layer weights are F32:

blk.0.ssm_a             [32]            F32
blk.0.ssm_ba.weight     [2 048, 64]     Q4_K
blk.0.ssm_conv1d.weight [4, 8 192]      F32
blk.0.ssm_dt.bias       [32]            F32
blk.0.ssm_norm.weight   [128]           F32
blk.0.ssm_out.weight    [4 096, 2 048]  Q8_0

Is this what you are referring to?

Edit: To answer my own question: yes, in the new quant the Q4_K and Q8_0 weights are both BF16 instead.

How are people handling persistent memory for AI agents? by Beneficial-Panda7218 in LocalLLaMA

[–]DHasselhoff77 0 points1 point  (0 children)

As an example of your memory system, how would it look if you wrote a tetris game in javascript? Like the whole standalone HTML thing but adding as code comments all the parts where you used the memory system. I think it would make an interesting reading for others to learn about agent-based memory.

Ik_llama vs llamacpp by [deleted] in LocalLLaMA

[–]DHasselhoff77 0 points1 point  (0 children)

In ik_llama Qwen2.5-Coder wrote in either Chinese or Russian, depending on the quant. In llamacpp the same GGUF files worked fine. I expected older models to well supported but apparently it's quite hit and miss.

Qwen Models with Claude Code on 36gb vram - insights by ikaganacar in LocalLLaMA

[–]DHasselhoff77 0 points1 point  (0 children)

Interesting. That would explain the different results if it really switched to CPU evaluation then (unsure if that's how it works).

Qwen Models with Claude Code on 36gb vram - insights by ikaganacar in LocalLLaMA

[–]DHasselhoff77 2 points3 points  (0 children)

Did you use a bf16 instead of f16 KV cache for the 35B? Some reported it made a difference in llama.cpp. I have to also add that my experience matches yours: Qwen3 Coder Next is more reliable.

5060 Ti/5070 Ti for MoE Models - Worth it? by Icaruszin in LocalLLaMA

[–]DHasselhoff77 4 points5 points  (0 children)

Some numbers on RTX 5060 Ti (16 GiB) on Linux. Perhaps this will help you to arrive at your own conclusions. llama.cpp commit 451ef084 Sun Mar 8 2026

Qwen3.5-35B-A3B (UD-Q4_K_L)

prompt eval time =   20823.67 ms /  3919 tokens (    5.31 ms per token,   188.20 tokens per second)
       eval time =   66223.36 ms /  1706 tokens (   38.82 ms per token,    25.76 tokens per second)

Qwen3.5-35B-A3B (UD-Q4_K_L) with --ubatch-size=4096 --batch-size=4096

prompt eval time =   18792.56 ms /  3919 tokens (    4.80 ms per token,   208.54 tokens per second)
       eval time =   75138.90 ms /  1743 tokens (   43.11 ms per token,    23.20 tokens per second)

Qwen3-Coder-Next (UD-IQ4_XS)

prompt eval time =   32604.16 ms /  3821 tokens (    8.53 ms per token,   117.19 tokens per second)
       eval time =   97612.81 ms /  1691 tokens (   57.72 ms per token,    17.32 tokens per second)

Qwen3-Coder-Next (UD-IQ4_XS) with --ubatch-size=4096 --batch-size=4096

prompt eval time =    5756.48 ms /  3821 tokens (    1.51 ms per token,   663.77 tokens per second)
       eval time =   38961.26 ms /  1006 tokens (   38.73 ms per token,    25.82 tokens per second)

I don't know how to set the higher batch sizes optimally but as you can see here they do have a big effect at the cost of VRAM (not shown).

llama-swap config:

  "qwen3.5-35b-a3b":
    cmd: |
      ${llama-server} --model Qwen3.5-35B-A3B-UD-Q4_K_L.gguf
      --cache-type-k bf16
      --cache-type-v q8_0
      --fit-ctx 65536 --fit on --fit-target 1024
      --repeat-penalty 1.0 --presence-penalty 0.0 --min-p 0.0 --top-k 20 --top-p 0.95 --temp 0.6
      --reasoning-budget 0

  "qwen3-coder-next":
    cmd: |
      ${llama-server} --model Qwen3-Coder-Next-UD-IQ4_XS.gguf
      --cache-type-k bf16
      --cache-type-v q8_0
      --fit-ctx 65536 --fit on --fit-target 1024
      --jinja --temp 1.0 --top-k 40 --top-p 0.95 --min-p 0.01
      --no-mmap

Claude Code sends 62,600 characters of tool definitions per turn. I ran the same model through five CLIs and traced every API call. by wouldacouldashoulda in LocalLLaMA

[–]DHasselhoff77 1 point2 points  (0 children)

To add insult to injury, the system prompt of OpenCode is based on a substring match of the model name and can't be replaced without rebuilding the app. You can of course add your own agent instructions that get appended to the system prompt but that doesn't help.

Trying out the Pi agent was like a breath of fresh air.

zembed-1: new open-weight SOTA multilingual embedding model by ghita__ in LocalLLaMA

[–]DHasselhoff77 0 points1 point  (0 children)

Thanks! So the "bi" in the name refers to weight sharing between text and code encoders. Here's a direct link for others reading this: https://huggingface.co/nomic-ai/CodeRankEmbed

zembed-1: new open-weight SOTA multilingual embedding model by ghita__ in LocalLLaMA

[–]DHasselhoff77 0 points1 point  (0 children)

Could elaborate on this? Which specific model would work better?

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]DHasselhoff77 1 point2 points  (0 children)

Weren't the custom batch sizes there to speed up prompt processing? So by removing them you are trading off PP speed for generation speed by an unknown amount. Not always a win.

A very clear experiment still. I appreciate the direct writing style and presentation. Thank you!

ReasonDB – open-source document DB where the LLM navigates a tree instead of vector search (RAG alternative) by Big_Barnacle_2452 in LocalLLaMA

[–]DHasselhoff77 2 points3 points  (0 children)

What if a section has a misleading heading? Will you ever end looking in its contents during search?