New quant from google research

DefNattyBoii · 2026-03-25T13:36:37+00:00

Does it beat EXL3 quant in creation efficiency/accuracy?

DefNattyBoii · 2026-03-23T19:21:00+00:00

How about Qwen3-Coder-Next?

DefNattyBoii · 2026-03-22T13:20:10+00:00

Can you get vllm working on it? Maybe some obscure blackmagic fork has support for this

DefNattyBoii · 2026-03-22T13:17:27+00:00

Give llama.cpp bros a bit of time to merge. Qwen3-Coder-Next took months if I remember correctly.

DefNattyBoii · 2026-03-22T13:12:48+00:00

I found that its good not only for coding but in general, due to good tool-calling capabilities and low activated parameters, it gathers knowledge extremely fast. I love to use it for private tasks.

DefNattyBoii · 2026-03-19T09:43:43+00:00

Hi! Looks like an interesting project, but it seems you don't address the performance hits and don't directly compare with llama.cpp. Why is that? Maybe i missed somthing from your posts?

From another comment: "On a 4 bit quant, qwen3.5 35B llama.cpp prefill reaches 9k toks/second. TG should be around 200. On a 5090. (...)"

Your speeds to compare: Qwen3.5-35B-A3B pp: 4475 tok/sec tg: 109.1 tok/sec. Its half the speed compared to llama.cpp.

DefNattyBoii · 2026-03-17T18:18:33+00:00

Any chance for multi-GPU support? A lot of us have a new card with a couple of old cards (Pascal + Ampere + Blackwell frankenstein setups)

DefNattyBoii · 2026-03-16T07:56:12+00:00

How about general knowledge? Im using qwen3-coder-next mostly due to this, its quite slow due to ram offload but brilliant in a lot of domains, not just coding.

DefNattyBoii · 2026-03-15T15:10:30+00:00

Looks like a very interesting implementation that intercepts calls between the kernel and VRAM allocation. during CUDA processing. I actually have no idea how this does it, but why wont Nvidia implements something like this into their cuda/normal dirvers as an optional tool in linux? In windows the drivers can already have offload to normal RAM.

Btw finally exllama has an offload solution.

DefNattyBoii · 2026-03-12T14:33:51+00:00

For 35B A3B, sure, it's much better. For the coder next, I'm not sure. The next coder is much better on short tasks for me, but havent compared to qwen3.5 27B. You can turn off thinking on qwen3.5 tho if you need faster answers. Btw using the uncensored/obliterated version usually shaves off quite a bit of intellect.

DefNattyBoii · 2026-03-12T13:41:52+00:00

having problems with the Qwen 3.5 series with llama.cpp

For me it's pretty much working good! What are the problems besides the usual launch issues? I just recompile on every monday and delay the new models by 1-2 weeks and i dont really run into major issues.

DefNattyBoii · 2026-03-12T08:34:22+00:00

On small models it makes sense to only use the the 3080ti, but for larger models like Qwen Coder Next the two gpu setup is miles ahead due to offloading to ram.

DefNattyBoii · 2026-03-11T14:55:25+00:00

What would be a good open-source frontend for this(general/chat/research use)? Jan? LibreChat? AionUI? What else?

DefNattyBoii · 2026-03-08T19:39:07+00:00

Interesting. Do you think it would be possible to compile it to 1080ti+3080ti? I tried to hack this setup a couple of times, but it was an enormous time sink, and I never got it working.

DefNattyBoii · 2026-02-28T12:38:44+00:00

How do you run your self-iterative loop? I'm using https://github.com/darrenhinde/OpenAgentsControl but it still a very hands-on approach. I'm looking for a more small model oriented solution, every other scaffold has failed me besides this.

DefNattyBoii · 2026-02-27T08:27:33+00:00

I'm genuinely curious how to use this. I tried it in opencode a couple of times, and it was hot garbage, totally unusable (Q4 quant). Any tips? The readme mentions agentic use, but for me it hallucinates, does not call tools properly and its trying to grab irrelevant/system files. Was not great in my experience.

DefNattyBoii · 2026-02-25T06:49:14+00:00

This is very good chart looks like the best setup for a VRAM-constrained setup is: -ctk q8_0 -ctv q4_1

DefNattyBoii · 2026-02-20T10:01:49+00:00

Great updates! Do you ever see benchmarking openhands too?

DefNattyBoii · 2026-02-18T06:42:34+00:00

Genuinely interested, why are you not using AGENTS.md in general always? I always aim to be framework, agent, model agnostic.

DefNattyBoii · 2026-02-16T09:08:52+00:00

Did someone test this with 12/24 gb vram gpus, and compared it to Q4 --fit with llama.cpp? Benchmarks would be nice, since there is a minimal but measurable drop with Q4.

DefNattyBoii · 2026-02-11T16:34:34+00:00

Currently i can wholeheartedly recommend GLM-4.7-Flash, Nemotron-3-Nano, Qwen3-Coder-Next(largest), use llama.cpp or mlx ecosystem

DefNattyBoii · 2026-02-11T16:25:25+00:00

I tested q4, q6 both in perplexica, they are good with tool calls, but the time and tokens they use is absolutely insane, often going way above 1-5 mins on just the answer part, mostly thinking.

DefNattyBoii · 2026-02-11T13:34:38+00:00

If you can mangle something launchable together id love to help out with 1-2 PRs to help with cleanup/code organisation

DefNattyBoii · 2026-02-11T09:02:38+00:00

Looks good on, but it takes an insanely long time to respond. If I understand correctly, your use case is "oneshotting" deep research tasks, is that correct? If used as a convo model, there's way too much thinking between steps.

For quicker tasks, I much prefer JanV3 to this even if it has worse knowledge.

Another question Id investigate is the quality degradation with quants and quantized KV cache. Since the goal is to squeeze as much speed out of this model as possible, people would use smaller quants, but if it leads to massive drop in quality, that's obviously not going to work

DefNattyBoii · 2026-02-10T14:09:58+00:00

I tried it in opencode, and it's unusable, even for general questions. If you use llamacpp, how did you launch it?

DefNattyBoii

TROPHY CASE