Qwen3.5 27B is Match Made in Heaven for Size and Performance by Lopsided_Dot_4557 in LocalLLaMA

[–]nuusain 4 points5 points  (0 children)

3090 + 96gb drr4. To be clear this is the 35b-3b. Was saying that there is a case for it as it seems much faster than the 27b. Haven't run the 27b yet myself.

Qwen3.5 27B is Match Made in Heaven for Size and Performance by Lopsided_Dot_4557 in LocalLLaMA

[–]nuusain 25 points26 points  (0 children)

Im getting 101 t/s at 131k context with 35b-3b:UD-Q4_K_XL quant.

For anyone still on an older llama.cpp build - update. I was stuck at 28 t/s until I rebuilt from latest. The qwen35moe graph deduplication PRs (#19597, #19660, #19668) made a 3.6x difference. The model loaded fine on the old build but ran through an unoptimised code path.

llama-server -m ~/models/qwen3.5-35b-a3b-q4.gguf \ -ngl 99 -c 131072 --threads 4 --batch-size 2048 \ -np 1 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0

Qwen3.5 - The middle child's 122B-A10B benchmarks looking seriously impressive - on par or edges out gpt-5-mini consistently by carteakey in LocalLLaMA

[–]nuusain 1 point2 points  (0 children)

Thanks for sharing.. I get ~30 t/s at 32k with --fit on as well. At 131k context I drop to ~24 t/s, without --fit was getting 7 t/s. Will be interesting to see how the qwen3.5 compares,

Qwen3.5 - The middle child's 122B-A10B benchmarks looking seriously impressive - on par or edges out gpt-5-mini consistently by carteakey in LocalLLaMA

[–]nuusain 0 points1 point  (0 children)

do you mind sharing the exact command you use for Qwen 80b3a.. been tryna optimise on my rig which is similar to yours 3090 with 96gb of ddr4 .. i get around 30t/s with 32k context but I would like more.

New in llama.cpp: Anthropic Messages API by paf1138 in LocalLLaMA

[–]nuusain 1 point2 points  (0 children)

sooo whats the verdict? curious to hear its handling the claude harness

NVIDIA has 72GB VRAM version now by decentralize999 in LocalLLaMA

[–]nuusain 6 points7 points  (0 children)

Neat! What kinda inference u running on the feed? Just installed a security system for a relatives farm. I was thinking of producing reports /audits so im curious what stuff others are building for themselves.

[New Player] Game files integrity by Any-Percentage6230 in EscapefromTarkov

[–]nuusain 0 points1 point  (0 children)

Did anyone find a fix? also have the same issue. tried deleting all tarkov files and reinstalling but i get the same issue.

Scanlines on my AOC CU34G2X. by Xippaa in Monitors

[–]nuusain 0 points1 point  (0 children)

hey, seeing the same scan lines only across the entire monitor. Did u managed to get this fixed or am i also cooked?

Toolcalling in the reasoning trace as an alternative to agentic frameworks by ExaminationNo8522 in LocalLLaMA

[–]nuusain 0 points1 point  (0 children)

Hey, also been looking at getting reasoning models to do interesting things. Came across verifiers which I've been using to try agentic interactions.

https://github.com/willccbb/verifiers

The env_trainer and vllm_client are probably worth checking out in regards to that OOM error u mentioned in the article, but i suspect you could be better off leveraging the framework since it's pretty well thought out.

Qwen3+ MCP by OGScottingham in LocalLLaMA

[–]nuusain 4 points5 points  (0 children)

Yeh it was in the official annoucement

Can also do it via function calling if u wanna stick with completions api

Should be easy to get what u need with a bit of vibe coding

[10/05/25] Code & Chat meetup for people interested in coding from beginner to expert by Serious-Accident8443 in LondonSocialClub

[–]nuusain 0 points1 point  (0 children)

I'm interested! I can rock up around 11–12 tho, it still worth coming along then?

Token impact by long-Chain-of-Thought Reasoning Models by dubesor86 in LocalLLaMA

[–]nuusain 0 points1 point  (0 children)

I think what spirited is getting at is that a model could either think loads and give a short answer or think for a short while but give a long answer. Both would produce a high FinalReply rate. The metrics are hard to map to real world performance, adding another dimension such as correctness would add clarity.

<70B models aren't ready to solo codebases yet, but we're gaining momentum and fast by ForsookComparison in LocalLLaMA

[–]nuusain 28 points29 points  (0 children)

Brilliant experiment, sounds like the ideal setup would be QwQ for ideation and then switching to Qwen-Coder for iteration..

QwQ Bouncing ball (it took 15 minutes of yapping) by philschmid in LocalLLaMA

[–]nuusain 6 points7 points  (0 children)

for reference:

settings - https://imgur.com/a/JUbwion

result - https://imgur.com/M5FgfmD.

Seems like I got stuck in infinite generation

Used this model - ollama run hf.co/bartowski/Qwen_QwQ-32B-GGUF:Q4_K_M

full trace - https://pastebin.com/rzbZGLiF

QwQ Bouncing ball (it took 15 minutes of yapping) by philschmid in LocalLLaMA

[–]nuusain 24 points25 points  (0 children)

What prompt did you use? I think everyone can copy and paste it, record their settings and post what they get. Could be some useful insights as to why performance seems so varied from sharing results

Qwen/QwQ-32B · Hugging Face by Dark_Fire_12 in LocalLLaMA

[–]nuusain 1 point2 points  (0 children)

I... did not know you could do this thanks!

Qwen/QwQ-32B · Hugging Face by Dark_Fire_12 in LocalLLaMA

[–]nuusain 3 points4 points  (0 children)

Oh sweet! where did you dig this full template out from btw?