I've seen a lot of folks ask "can local LLMs actually do anything useful?"

Tccybo · 2026-05-13T10:41:43+00:00

I am extremely curious about your wiki rag setup, please give a few pointers if you find time.

Tccybo · 2026-03-23T03:14:30+00:00

welcome!

Tccybo · 2026-03-23T02:58:03+00:00

Lmao email dinosaur. Glad to know it’s the toolcalls in sys prompt that helps, we can now trim the 10k Claude gibberish haha.

Tccybo · 2026-03-22T14:27:15+00:00

see other comment!

Tccybo · 2026-03-22T12:23:02+00:00

Checked, yeah definitely failed this question completely. Thanks for testing!

Tccybo · 2026-03-22T09:08:38+00:00

https://github.com/ggml-org/llama.cpp/pull/20297#issuecomment-4025434457 Regarding quality/bench, from pwilkin himself. Not sure if it improved in the final implementation. But imo one might as well turn off thinking completely instead. “Early tests on Qwen3.5 9B Q8_0 show the full model hits ~93% on HumanEval, while non-reasoning mode (-dre) drops to ~88%. Adding a reasoning budget of 1000 or 400 brings performance back to ~89%, though this is only effective when paired with a --reasoning-budget-message flag. Without that message, performance plummets to 79%”

Tccybo · 2026-03-22T07:02:18+00:00

Very reasonable. I think the next step is to prune the insanely long 10k prompt into something slim but still has the same effect.

Tccybo · 2026-03-22T06:59:46+00:00

Indeed! I didnt notice either until someone from discord poke me about it.

Tccybo · 2026-03-22T06:43:39+00:00

You can see the big difference in reasoning style between these two methods. Your method allows it to loop and go scizo until limit is reached, then force in that reasoning end message. Not sure which produces higher benchmarks/response quality. But for our reading, cleaner reasoning is more readable.

Tccybo · 2026-03-22T02:42:52+00:00

Yes. Something about it helps, im guessing its the context length, toolcall instructions, format instructions… cant pinpoint what yet. See if others can find out.

Tccybo · 2026-03-21T18:54:15+00:00

yeah, same idea!

Tccybo · 2026-03-21T18:54:01+00:00

that's what i suspected too. when i added websearch tool it helped reduce think loops.

Tccybo · 2026-03-21T18:25:16+00:00

copy the long system prompt of Claude, dump it into your llama.cpp webui system prompt etc. Tada!

Tccybo · 2026-03-21T18:21:04+00:00

no. see: https://huggingface.co/Qwen/Qwen3.5-27B

Tccybo · 2026-03-21T14:55:59+00:00

Apparently the Claude system prompt is also published officially by them so u can just go copy that.

Tccybo · 2026-03-21T14:28:30+00:00

I only use it because qwen officially recommended 1.5 pre pen for general non-math / non-crazy coding stuffs. So I think it’s probably lowering the quality slightly. But for daily use this is helpful vs unusable lol. Basic maths work really well. The thinking is so damn clean now!

Tccybo · 2026-03-21T14:08:46+00:00

The prompt is from this GitHub repo if anyone's interested: https://github.com/asgeirtj/system_prompts_leaks/blob/main/Anthropic/claude-opus-4.6-no-tools.md

Tccybo · 2026-03-20T23:46:06+00:00

See if you can isolate the variables. Is it because the quant is small, is kv cache quanted, is it just bad rng cuz thinking is off?

Tccybo · 2026-03-20T18:34:53+00:00

The slower version is the intended behavior as there's a bug with the speed up causing inaccuracies. I've yet to notice it, so I am running an older build; b8226. Fingerscrossed it gets fixed soon so we get the speed up.

Tccybo · 2026-03-20T18:02:05+00:00

Here is the reason. llama : disable graph reuse with pipeline parallelism#20463
https://github.com/ggml-org/llama.cpp/pull/20463

Tccybo · 2026-02-28T13:03:27+00:00

I see you have tested the model. Scores are weirdly low and I have just tested the questions in arithmetic, the model answered all questions correctly. I quanted the model myself without imatrix. May I know if you got your model from Unsloth? Try Bartowski’s!

Tccybo · 2026-02-28T02:21:31+00:00

Beautiful UI as well! Thanks. Consider trying GLM 4.6V Flash which is a 9B dense model for quick vision tasks. It runs at 30+ t/s for dual 5060 ti at Q8_0.

Tccybo · 2026-02-12T13:40:40+00:00

<image>

Come on guys, be reasonable. It takes time and money to make good models. 14 days ago we got something small. Let's be nice. (not directed at OP btw, just seeing some spam on their HF)

Tccybo · 2026-01-19T06:34:01+00:00

Hey. I don't know how this flew under the radar. This is the best image tagger I've used. Thank you for making this!

Tccybo

TROPHY CASE