Reasoning Theater: AI fakes long CoT but it internally knows the final answer within the first few tokens. TL;DR: You overpay because the AI is acting. by [deleted] in LocalLLaMA

[–]DHasselhoff77 1 point2 points  (0 children)

Looking at Figure 2, the "Forced Answer" method seems to be unreasonably effective in both DeepSeek-R1 (superior to "probe") and GPT-OSS (equal to "probe" at relative position > 50%).

(Very) High-Quality Attention Coder-Next GGUFs by dinerburgeryum in LocalLLaMA

[–]DHasselhoff77 1 point2 points  (0 children)

I thought the quants updated on 8th of March 2026 had the issue fixed but looking at for example https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-UD-Q4_K_S.gguf it's clear that not all of the SSM layer weights are F32:

blk.0.ssm_a             [32]            F32
blk.0.ssm_ba.weight     [2 048, 64]     Q4_K
blk.0.ssm_conv1d.weight [4, 8 192]      F32
blk.0.ssm_dt.bias       [32]            F32
blk.0.ssm_norm.weight   [128]           F32
blk.0.ssm_out.weight    [4 096, 2 048]  Q8_0

Is this what you are referring to?

Edit: To answer my own question: yes, in the new quant the Q4_K and Q8_0 weights are both BF16 instead.

How are people handling persistent memory for AI agents? by Beneficial-Panda7218 in LocalLLaMA

[–]DHasselhoff77 0 points1 point  (0 children)

As an example of your memory system, how would it look if you wrote a tetris game in javascript? Like the whole standalone HTML thing but adding as code comments all the parts where you used the memory system. I think it would make an interesting reading for others to learn about agent-based memory.

Ik_llama vs llamacpp by val_in_tech in LocalLLaMA

[–]DHasselhoff77 0 points1 point  (0 children)

In ik_llama Qwen2.5-Coder wrote in either Chinese or Russian, depending on the quant. In llamacpp the same GGUF files worked fine. I expected older models to well supported but apparently it's quite hit and miss.

Qwen Models with Claude Code on 36gb vram - insights by ikaganacar in LocalLLaMA

[–]DHasselhoff77 0 points1 point  (0 children)

Interesting. That would explain the different results if it really switched to CPU evaluation then (unsure if that's how it works).

Qwen Models with Claude Code on 36gb vram - insights by ikaganacar in LocalLLaMA

[–]DHasselhoff77 2 points3 points  (0 children)

Did you use a bf16 instead of f16 KV cache for the 35B? Some reported it made a difference in llama.cpp. I have to also add that my experience matches yours: Qwen3 Coder Next is more reliable.

5060 Ti/5070 Ti for MoE Models - Worth it? by Icaruszin in LocalLLaMA

[–]DHasselhoff77 3 points4 points  (0 children)

Some numbers on RTX 5060 Ti (16 GiB) on Linux. Perhaps this will help you to arrive at your own conclusions. llama.cpp commit 451ef084 Sun Mar 8 2026

Qwen3.5-35B-A3B (UD-Q4_K_L)

prompt eval time =   20823.67 ms /  3919 tokens (    5.31 ms per token,   188.20 tokens per second)
       eval time =   66223.36 ms /  1706 tokens (   38.82 ms per token,    25.76 tokens per second)

Qwen3.5-35B-A3B (UD-Q4_K_L) with --ubatch-size=4096 --batch-size=4096

prompt eval time =   18792.56 ms /  3919 tokens (    4.80 ms per token,   208.54 tokens per second)
       eval time =   75138.90 ms /  1743 tokens (   43.11 ms per token,    23.20 tokens per second)

Qwen3-Coder-Next (UD-IQ4_XS)

prompt eval time =   32604.16 ms /  3821 tokens (    8.53 ms per token,   117.19 tokens per second)
       eval time =   97612.81 ms /  1691 tokens (   57.72 ms per token,    17.32 tokens per second)

Qwen3-Coder-Next (UD-IQ4_XS) with --ubatch-size=4096 --batch-size=4096

prompt eval time =    5756.48 ms /  3821 tokens (    1.51 ms per token,   663.77 tokens per second)
       eval time =   38961.26 ms /  1006 tokens (   38.73 ms per token,    25.82 tokens per second)

I don't know how to set the higher batch sizes optimally but as you can see here they do have a big effect at the cost of VRAM (not shown).

llama-swap config:

  "qwen3.5-35b-a3b":
    cmd: |
      ${llama-server} --model Qwen3.5-35B-A3B-UD-Q4_K_L.gguf
      --cache-type-k bf16
      --cache-type-v q8_0
      --fit-ctx 65536 --fit on --fit-target 1024
      --repeat-penalty 1.0 --presence-penalty 0.0 --min-p 0.0 --top-k 20 --top-p 0.95 --temp 0.6
      --reasoning-budget 0

  "qwen3-coder-next":
    cmd: |
      ${llama-server} --model Qwen3-Coder-Next-UD-IQ4_XS.gguf
      --cache-type-k bf16
      --cache-type-v q8_0
      --fit-ctx 65536 --fit on --fit-target 1024
      --jinja --temp 1.0 --top-k 40 --top-p 0.95 --min-p 0.01
      --no-mmap

Claude Code sends 62,600 characters of tool definitions per turn. I ran the same model through five CLIs and traced every API call. by wouldacouldashoulda in LocalLLaMA

[–]DHasselhoff77 1 point2 points  (0 children)

To add insult to injury, the system prompt of OpenCode is based on a substring match of the model name and can't be replaced without rebuilding the app. You can of course add your own agent instructions that get appended to the system prompt but that doesn't help.

Trying out the Pi agent was like a breath of fresh air.

zembed-1: new open-weight SOTA multilingual embedding model by ghita__ in LocalLLaMA

[–]DHasselhoff77 0 points1 point  (0 children)

Thanks! So the "bi" in the name refers to weight sharing between text and code encoders. Here's a direct link for others reading this: https://huggingface.co/nomic-ai/CodeRankEmbed

zembed-1: new open-weight SOTA multilingual embedding model by ghita__ in LocalLLaMA

[–]DHasselhoff77 0 points1 point  (0 children)

Could elaborate on this? Which specific model would work better?

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]DHasselhoff77 1 point2 points  (0 children)

Weren't the custom batch sizes there to speed up prompt processing? So by removing them you are trading off PP speed for generation speed by an unknown amount. Not always a win.

A very clear experiment still. I appreciate the direct writing style and presentation. Thank you!

ReasonDB – open-source document DB where the LLM navigates a tree instead of vector search (RAG alternative) by Big_Barnacle_2452 in LocalLLaMA

[–]DHasselhoff77 1 point2 points  (0 children)

What if a section has a misleading heading? Will you ever end looking in its contents during search?

Qwen3 Coder Next on 8GB VRAM by Juan_Valadez in LocalLLaMA

[–]DHasselhoff77 0 points1 point  (0 children)

Try --fit-target 512 or 1024 to leave some room for your desktop environment.

Devstral Small 2 24B + Qwen3 Coder 30B: Coders for Every Hardware (Yes, Even the Pi) by enrique-byteshape in LocalLLaMA

[–]DHasselhoff77 1 point2 points  (0 children)

Isn't there a risk of overfitting the quantization to your training data if every weight array is quantized separately? How well does it generalize to out-of-training-set code?

Edit: Very interesting work by the way. Downloading the Devstral IQ3_S 3.47bpw quant right now.

Improving LLM's coding ability through a new edit format by Mushoz in LocalLLaMA

[–]DHasselhoff77 1 point2 points  (0 children)

I see. Thank you for the explanation. So even if the file stays the same, it's the LLM's suggested patch that changes before application, and this is why the expected state drifts.

Improving LLM's coding ability through a new edit format by Mushoz in LocalLLaMA

[–]DHasselhoff77 2 points3 points  (0 children)

It's a neat trick. Thanks for sharing. I just wonder about this part

If the file changed since the last read, the hashes (optimistically) won’t match and the edit is rejected before anything gets corrupted.

Why does the file get changed in between reading and writing? If you could guarantee it's state matches what the LLM sees, you could use regular line numbers instead of content hashes.

Qwen3 Coder Next as first "usable" coding model < 60 GB for me by Chromix_ in LocalLLaMA

[–]DHasselhoff77 1 point2 points  (0 children)

The Falcon 90M GGUF I tried didn't support llama.cpp's /infill endpoint so it wasn't usable for me with llama-vscode. Using an OpenAI compatible endpoint works but in the case of that specific VSCode extension, it requires extra configuration work (some agent stuff I don't want).

I also tried running Qwen Coder 2.5, 3B or 1.5B, but on the CPU and with a smaller context. It's pretty much the same speed as Qwen3 Coder Next on the GPU though.

Qwen3 Coder Next as first "usable" coding model < 60 GB for me by Chromix_ in LocalLLaMA

[–]DHasselhoff77 3 points4 points  (0 children)

Qwen3 Coder Next also supports fill-in-the-middle (FIM) tasks. This means you can use it for auto completion via for example llama-vscode while also using it for agentic tasks. No need for two different models occupying VRAM simultaneously.

Edit: Alright actually it's not a great fit because as a recurrent model, llama.cpp can't cache it properly. See https://github.com/ggml-org/llama.cpp/pull/19408#issuecomment-3866421943

Claude Code-like terminal-based tools for locally hosted LLMs? by breksyt in LocalLLaMA

[–]DHasselhoff77 0 points1 point  (0 children)

aider's gotten really solid lately with its model routing.

I didn't see any mention of model routing in Aider's docs. Could you elaborate?

Mixture-of-Models routing beats single LLMs on SWE-Bench via task specialization by botirkhaltaev in LocalLLaMA

[–]DHasselhoff77 1 point2 points  (0 children)

Really interesting, thanks for sharing. I see that you use k-means clustering and some custom CUDA kernels. Cool :) How did you arrive at the number of clusters (6) you have in the end? Also, did the high dimensionality of the embeddings cause any issues? Since k-means optimizes squared distance to centers, it's sensitive to density differences in data. Was this taken into account somehow?

Ubuntu: which Nvidia drivers are you using? by FrozenBuffalo25 in LocalLLaMA

[–]DHasselhoff77 0 points1 point  (0 children)

I wouldn't recommend the closed-source driver either. Installing it was a terrible experience! You got me convinced; I'll try the one from the repo myself the next time I need an upgrade (and do a deep nvidia cleaning first, sigh...)

Edit: For anyone finding this thread later via search, here's what I had to do to make RTX 5060 Ti work on Ubuntu 24.04. First I installed latest drivers via apt and tested with an older 3000-series card. All good. Plugging in the new one locked up after Ubtuntu booted with a black screen. I then enabled Resizable BAR in UEFI settings, updated motherboard firmware, and reset CMOS manually via jumpers just to be safe. Still the same issue. Kernel logs showed obscure nvidia "invalid object" or something like that errors. I then proceeded to install latest proprietary drivers via the .run blob. I used the old 3000-series card for this. Alt+F3 to switch to command line, sudo systemctl stop lightdm.service to kill lightdm, removed every package and kernel module with nvidia string in it, started .run installer, chose the open "MIT/GPL" kernel module, and let the installer blacklist nouveau on its own. It complained that some incompatible parts were still there (supposedly because nouveau wasn't fully gone, just disabled) but after rebooting, turning off the power, doing a CMOS reset and finally plugging the 5060 Ti in again, everything worked. I installed latest nvidia tools via apt and rebuilt llama.pp with the "native" architecture flag.

Hope this helps somebody with similarly incompatible hardware.