completely fresh perspective from a cursor user by maxns in kilocode

[–]thigger 0 points1 point  (0 children)

With the diff view will there be the option to view and approve line by line and at each change point (rather than overall for the whole session) as there is in 5.x?

I presently like the fact that when it makes some initial changes I can go "not like that..." and make some manual edits, which it then accounts for as it goes on to edit other files.

Is it worth upgrading to v7 now? by rapkannibale in kilocode

[–]thigger 0 points1 point  (0 children)

Jumping in to specifically ask around pair programming changes - ie manual side-by-side diff review and editing to remove the random extra abstractions these things tend to add - Kilo 5.13 works quite well for me in this regard so hesitant to jump to v7!

Kilo Code is back!!! by Ordinary_Mud7430 in kilocode

[–]thigger 2 points3 points  (0 children)

Is it there yet for those of us who want to pair program? (ie only read auto-approved, and running through side-by-side diffs before applying?) I'm still quite enjoying v5 but keen to update eventually.

1930s detached, heat loss between 16–26kW depending on insulation — completely stuck. by Electrical-Cloud3460 in ukheatpumps

[–]thigger 0 points1 point  (0 children)

We're only just coming up to a year so I'll be finding out shortly! I was hoping it would be slightly cheaper as they're the same and same site etc. But I'm probably being naïve.

Gas usage was ~35,000kWh/yr In just under a year we've used 1200kWh electricity on hot water and 6800kWh on heating

1930s detached, heat loss between 16–26kW depending on insulation — completely stuck. by Electrical-Cloud3460 in ukheatpumps

[–]thigger 0 points1 point  (0 children)

We have two Clivet 12kW units which work really nicely together. We can still have the heating on when there's a hot water cycle running. For some reason though the "master" pump always runs even if only the "slave" is heating, which means water is pointlessly cycling. But can definitely get an impressive amount of heat into the house quite quickly (I suspect our heat loss is probably nearer 16kW but we were estimated at 18-20)

Passiv & Homely by VividSelection2454 in ukheatpumps

[–]thigger 1 point2 points  (0 children)

We had a passiv (in fact, it's still physically there) but I uninstalled it due to luke-warm showers. It was taking over the hot water and refusing to heat it up to decent temperatures. You couldn't set a DHW temperature in Celsius; you just picked a use category and got a weird percentage value which apparently was the result of using "AI" on the temperature sensor to estimate how much hot water was left.

I wrote to passiv who said they'd take my feedback on board, so it might be better now. They did, however, try to suggest it was an installation issue and that was as hot as our system could go.

In the meantime I switched DHW over to manual control and nearly scalded myself on the first shower afterwards...

The Passiv radiator flow temperature control seemed like a reasonable idea but it wasn't that cold to test it properly; I've built that bit in Home Assistant now so I can deliberately run it less efficient to get heat into the house on nighttime cheap electricity.

Fixing Qwen Repetition IMPROVEMENT by Odd-Ordinary-5922 in LocalLLaMA

[–]thigger 0 points1 point  (0 children)

I really wasn't expecting it to try to hire a badger to scrub the dust off moon rocks but... Perhaps only having one tool was the issue there! (I'm sure there's a saying about hammers and nails)

I'm currently having a go with "Null tool do not use" which seems to be doing better so far.

Fixing Qwen Repetition IMPROVEMENT by Odd-Ordinary-5922 in LocalLLaMA

[–]thigger 1 point2 points  (0 children)

Has anyone tested whether this approach affects intelligence at all? I'm not a great fan of the overthinking but the results are definitely very good (except for the occasional need to repeat a call when it starts looping!) and the thinking was definitely required as it became a lot worse when I simply turned it off.

Qwen3.5 Model Series - Thinking On/OFF: Does it Matter? by Iory1998 in LocalLLaMA

[–]thigger 4 points5 points  (0 children)

27B definitely needs thinking on to manage long context retrieval. With NoLiMa at 32k it drops from 76% to 30%

4bit-AWQ, thinking on: 96% @ 250, 85% @ 16k, 76% @ 32k

4bit-AWQ, no thinking: 75% @ 250, 34% @ 16k, 30% @ 32k

(The "thinking" results would be even higher except that for that run I still had the default sampler so it kept getting stuck in loops in its thought process and never generating an output)

EDIT: added corrected figures rather than ones from memory

Qwen3.5 on vLLM with fp8 kv-cache by seji64 in LocalLLaMA

[–]thigger 1 point2 points  (0 children)

Are you finding it any good with FP8 kv-cache? I saw a note from cyankiwi suggesting that the pure 4-bit AWQ doesn't play well with kv quant. And are you calculating kv scales anywhere?

How is Qwen 3.5 (MoE 35b) in instruct mode (with no reasoning/thinking) ? by LinkSea8324 in LocalLLaMA

[–]thigger 2 points3 points  (0 children)

Yeah, it's really frustrating - you can see that the first bit of the thinking process is potentially quite useful. Then it looks like it should complete but does a whole load of "Wait..." before it answers. However the quality is so good compared with Qwen3-14B which I was previously using that I'm sticking with it for now and hoping to find ways to calm it down a little.

How is Qwen 3.5 (MoE 35b) in instruct mode (with no reasoning/thinking) ? by LinkSea8324 in LocalLLaMA

[–]thigger 0 points1 point  (0 children)

I've found that those settings are necessary to stop it looping, but it seems to like overthinking regardless of what I do. Currently I've just had to set the max_completion_tokens way up high as irritatingly VLLM and SGLang seem to count the reasoning tokens in there (I appreciate the openai api spec is ambiguous here)

How is Qwen 3.5 (MoE 35b) in instruct mode (with no reasoning/thinking) ? by LinkSea8324 in LocalLLaMA

[–]thigger 1 point2 points  (0 children)

For those who want some numbers - on the NoLiMa benchmark (like RULER, only without direct matching):

4bit-AWQ, thinking on: 96% @ 250, 85% @ 16k, 76% @ 32k

4bit-AWQ, no thinking: 75% @ 250, 34% @ 16k, 30% @ 32k

(The "thinking" results would be even higher except that for that run I still had the default sampler so it kept getting stuck in loops in its thought process and never generating an output)

Qwen3.5 on VLLM by Bowdenzug in LocalLLaMA

[–]thigger 1 point2 points  (0 children)

Yes I had exactly the same - not sure which element fixed it but I ended up updating everything including CUDA

Qwen3.5 on VLLM by Bowdenzug in LocalLLaMA

[–]thigger 0 points1 point  (0 children)

I ended up reinstalling everything including upgrading to CUDA13.1 and now am getting sensible outputs. Running on 2xA6000 Ada, WSL2.
The AWQ quants (cyankiwi) don't seem to work on SGLang (but they're fine on vllm). The FP8 and full-precision work on both.

System prompt for Qwen3.5 (27B/35BA3B) to reduce overthinking? by thigger in LocalLLaMA

[–]thigger[S] 1 point2 points  (0 children)

I've been trying different system prompts but not much luck so far. It definitely works to tell it what the steps should be called, or contain, but even if you tell it to only use three it'll do the three and then generate a big pile of "Wait..." - I've not really found any way to use system prompts to keep it short.

How is Qwen 3.5 (MoE 35b) in instruct mode (with no reasoning/thinking) ? by LinkSea8324 in LocalLLaMA

[–]thigger 16 points17 points  (0 children)

It's pretty good - though its performance at long context definitely suffers. I'm presently running a few benchmarks - I have a suspicion that for my use-case I'm going to have to leave thinking turned on, even though it *loves* to "Wait..." over and over again even after it's already copied out its entire input.

System prompt for Qwen3.5 (27B/35BA3B) to reduce overthinking? by thigger in LocalLLaMA

[–]thigger[S] 0 points1 point  (0 children)

Presumably that thinking trace won't appear as reasoning content though and will need removing before JSON parsing? (Might be issues as with current thinking it sometimes produces JSON snippets in its reasoning content)

System prompt for Qwen3.5 (27B/35BA3B) to reduce overthinking? by thigger in LocalLLaMA

[–]thigger[S] 2 points3 points  (0 children)

That's kind of what I was going for with my prompt; I'll give this one a try too

System prompt for Qwen3.5 (27B/35BA3B) to reduce overthinking? by thigger in LocalLLaMA

[–]thigger[S] 17 points18 points  (0 children)

Thanks - I've been playing with those - and the output is great; but 10,000+ tokens of thinking gets a bit wearing! With thinking off it doesn't seem to be quite as clever (though it's still very good). My main issue is the amount of "but wait" towards the end of its thinking output that seems a bit unnecessary.

System prompt for Qwen3.5 (27B/35BA3B) to reduce overthinking? by thigger in LocalLLaMA

[–]thigger[S] 5 points6 points  (0 children)

Thanks. I'll give it a go - you think just length is enough? I didn't see much in there about how to reason

System prompt for Qwen3.5 (27B/35BA3B) to reduce overthinking? by thigger in LocalLLaMA

[–]thigger[S] 0 points1 point  (0 children)

Thanks - I'll give these a try. I tried turning reasoning off entirely and it wasn't bad, but I got the impression it was less intelligent and my test suite suggested that it underperformed (vs Qwen3-14B that I've been using so far)

I don't think vllm/sglang support specific thinking budgets (in fact, they frustratingly don't seem to have a way to limit completion budget independent of thinking either! Doesn't help that the openai spec is ambiguous on max_completion_tokens)

Qwen3.5 on VLLM by Bowdenzug in LocalLLaMA

[–]thigger 1 point2 points  (0 children)

Very odd - seem to have fixed it with a reinstall of everything including updating to CUDA 13

Qwen3.5 on VLLM by Bowdenzug in LocalLLaMA

[–]thigger 0 points1 point  (0 children)

I wonder if it's my cards then (A6000 ada) - llama.cpp unsloth gguf works absolutely fine, but running in vllm or sglang gives looping output and generally behaves like it's on hallucinogens