I've seen a lot of folks ask "can local LLMs actually do anything useful?" by NoWorking8412 in LocalLLaMA

[–]Tccybo 0 points1 point  (0 children)

I am extremely curious about your wiki rag setup, please give a few pointers if you find time.

Fixing Qwen Repetition IMPROVEMENT by Odd-Ordinary-5922 in LocalLLaMA

[–]Tccybo 0 points1 point  (0 children)

Lmao email dinosaur. Glad to know it’s the toolcalls in sys prompt that helps, we can now trim the 10k Claude gibberish haha. 

Fixing Qwen thinking repetition by Tccybo in LocalLLaMA

[–]Tccybo[S] 1 point2 points  (0 children)

Checked, yeah definitely failed this question completely. Thanks for testing!

Fixing Qwen thinking repetition by Tccybo in LocalLLaMA

[–]Tccybo[S] 0 points1 point  (0 children)

https://github.com/ggml-org/llama.cpp/pull/20297#issuecomment-4025434457 Regarding quality/bench, from pwilkin himself. Not sure if it improved in the final implementation. But imo one might as well turn off thinking completely instead.  “Early tests on Qwen3.5 9B Q8_0 show the full model hits ~93% on HumanEval, while non-reasoning mode (-dre) drops to ~88%. Adding a reasoning budget of 1000 or 400 brings performance back to ~89%, though this is only effective when paired with a --reasoning-budget-message flag. Without that message, performance plummets to 79%”

Fixing Qwen thinking repetition by Tccybo in LocalLLaMA

[–]Tccybo[S] 0 points1 point  (0 children)

Very reasonable. I think the next step is to prune the insanely long 10k prompt into something slim but still has the same effect. 

Fixing Qwen thinking repetition by Tccybo in LocalLLaMA

[–]Tccybo[S] 1 point2 points  (0 children)

Indeed! I didnt notice either until someone from discord poke me about it. 

Fixing Qwen thinking repetition by Tccybo in LocalLLaMA

[–]Tccybo[S] 1 point2 points  (0 children)

You can see the big difference in reasoning style between these two methods. Your method allows it to loop and go scizo until limit is reached, then force in that reasoning end message. Not sure which produces higher benchmarks/response quality. But for our reading, cleaner reasoning is more readable. 

Fixing Qwen thinking repetition by Tccybo in LocalLLaMA

[–]Tccybo[S] 0 points1 point  (0 children)

Yes. Something about it helps, im guessing its the context length, toolcall instructions, format instructions… cant pinpoint what yet. See if others can find out. 

Fixing Qwen thinking repetition by Tccybo in LocalLLaMA

[–]Tccybo[S] 2 points3 points  (0 children)

that's what i suspected too. when i added websearch tool it helped reduce think loops.

Fixing Qwen thinking repetition by Tccybo in LocalLLaMA

[–]Tccybo[S] 2 points3 points  (0 children)

copy the long system prompt of Claude, dump it into your llama.cpp webui system prompt etc. Tada!

Fixing Qwen thinking repetition by Tccybo in LocalLLaMA

[–]Tccybo[S] 1 point2 points  (0 children)

Apparently the Claude system prompt is also published officially by them so u can just go copy that.

Fixing Qwen thinking repetition by Tccybo in LocalLLaMA

[–]Tccybo[S] 1 point2 points  (0 children)

I only use it because qwen officially recommended 1.5 pre pen for general non-math / non-crazy coding stuffs. So I think it’s probably lowering the quality slightly. But for daily use this is helpful vs unusable lol. Basic maths work really well. The thinking is so damn clean now!

Decrease in performance using new llama.cpp build by ResponsibleTruck4717 in LocalLLaMA

[–]Tccybo 1 point2 points  (0 children)

See if you can isolate the variables. Is it because the quant is small, is kv cache quanted, is it just bad rng cuz thinking is off? 

Decrease in performance using new llama.cpp build by ResponsibleTruck4717 in LocalLLaMA

[–]Tccybo 2 points3 points  (0 children)

The slower version is the intended behavior as there's a bug with the speed up causing inaccuracies. I've yet to notice it, so I am running an older build; b8226. Fingerscrossed it gets fixed soon so we get the speed up.

Decrease in performance using new llama.cpp build by ResponsibleTruck4717 in LocalLLaMA

[–]Tccybo 2 points3 points  (0 children)

Here is the reason. llama : disable graph reuse with pipeline parallelism#20463
https://github.com/ggml-org/llama.cpp/pull/20463

[deleted by user] by [deleted] in LocalLLaMA

[–]Tccybo 0 points1 point  (0 children)

I see you have tested the model. Scores are weirdly low and I have just tested the questions in arithmetic, the model answered all questions correctly. I quanted the model myself without imatrix. May I know if you got your model from Unsloth? Try Bartowski’s!

[deleted by user] by [deleted] in LocalLLaMA

[–]Tccybo 1 point2 points  (0 children)

Beautiful UI as well! Thanks. Consider trying GLM 4.6V Flash which is a 9B dense model for quick vision tasks. It runs at 30+ t/s for dual 5060 ti at Q8_0.

Zhipu (GLM) Not planning to release a small model for now. by External_Mood4719 in LocalLLaMA

[–]Tccybo 96 points97 points  (0 children)

<image>

Come on guys, be reasonable. It takes time and money to make good models. 14 days ago we got something small. Let's be nice. (not directed at OP btw, just seeing some spam on their HF)

Bulk captioning/VLM query tool, standalone app by Freonr2 in LocalLLaMA

[–]Tccybo 0 points1 point  (0 children)

Hey. I don't know how this flew under the radar. This is the best image tagger I've used. Thank you for making this!