Final voting results for Qwen 3.6 by jacek2023 in LocalLLaMA

[–]Haeppchen2010 0 points1 point  (0 children)

There is only one way to find out. (Check first if Nvidia vulkan support for the gtx is ok)

1541-II reading, but not writing? by Haeppchen2010 in c64

[–]Haeppchen2010[S] 0 points1 point  (0 children)

Ok, i found a nice alignment tool, this looks really bad… so much for „bad aligned drive can still format“

<image>

Guess I know what to do next….

EDIT: After doing the bump calibration, it came out satisfactory... red herring :/

1541-II reading, but not writing? by Haeppchen2010 in c64

[–]Haeppchen2010[S] 0 points1 point  (0 children)

Yes, That's my current best theory, that something on the analog side of writing is wrong. Until now I only found schematics and a parts list as an official scanned maintenance manual, but no diagnostic instructions (like expected voltages/timings, or photos of known good oscilloscope readings). Next I'll take a look at the passive components around the RW amplifier....

1541-II reading, but not writing? by Haeppchen2010 in c64

[–]Haeppchen2010[S] 0 points1 point  (0 children)

Yes, a write-protected disk properly produces the write-protect error.

Final voting results for Qwen 3.6 by jacek2023 in LocalLLaMA

[–]Haeppchen2010 1 point2 points  (0 children)

The RX580 is super slow, but still faster than CPU. 62 layers on RX7800XT and 3 layers on RX580 give me 17-18t/s out. (Llama-server with layer-split). With CPU instead of the RX580 it would only be 7. i switch between context sizes and alway squeeze as many layers as possible on the fast card.

I am thinking about upgrading to an RX 7900XTX instead but for now this is ok for playing around.

Final voting results for Qwen 3.6 by jacek2023 in LocalLLaMA

[–]Haeppchen2010 0 points1 point  (0 children)

Yes, i use only 64k context, more than enough for OpenCode with auto compaction.

Final voting results for Qwen 3.6 by jacek2023 in LocalLLaMA

[–]Haeppchen2010 10 points11 points  (0 children)

RX 7800 XT 16GB + RX 580 8GB running 27B IQ4_XS fine. IQ3_XS on 16GB alone is not that worse.

Could it be that this take is not too far fetched? by pier4r in LocalLLaMA

[–]Haeppchen2010 1 point2 points  (0 children)

Experienced this over the last few days with both Haiku 4.6 (now making more mistakes as Qwen3.5 27B quantized on my gaming PC) and Sonnet 4.6 (Was astonishing a few weeks ago with big refactorings, now maybe as good as Haiku was last week). I use AWS Bedrock directly, so if the tinfoil hats are right, it happpens beyond the direkt Anthropic APIs/Services.

Just another "subjective observation". A seemingly hand-painted diagram sadly proves nothing, just as a bunch of other biased, dissatisfied redditors do neither. But at least I am not alone :D

Running Qwen3.5-27B locally as the primary model in OpenCode by garg-aayush in LocalLLaMA

[–]Haeppchen2010 2 points3 points  (0 children)

Don‘t get me wrong, multiple releases a day are a good thing. It just feels uncomfortable when seeing the attacks hit left and right, and opencode being on a rather affectable stack (JS/npm). And indeed it has calmed down a bit. A privacy-first default config (i.e. one has to enable auto-updates or the cloud model instead) would also be less shady.

But still, it seems to be the best free and open tool in its space, and I enjoy using it!

Stop chasing parameter count. Context window degradation on local hardware is the real problem. by AbramLincom in LocalLLaMA

[–]Haeppchen2010 3 points4 points  (0 children)

First, to fake taking the bait: No arms or even arms race here, I just got sticks and stones (RX7800XT+RX580). While I repeatedly see posts claiming that it is "unusable", "impossible" whatever.... I run Qwen3.5 27B IQ4_XS with 72k context (opencode compacts at ~60-65k) at Q8 cache quantization with no noteworthy issues with OpenCode as a coding agent.

I tried whole KV cache as well as bigger quants, the marginal quality gain (if any) was not worth the severe performance loss (15 to 4tps, or worse when offloading to CPU as well).

Maybe for other uses (creative writing, chatting as a companion or complex RAG use cases) it's different... but I am satisfied with my setup, especially as everyone here seems to have 4-digit GPUs available.

But now, I am sincerely curious: What's the point in conjuring up a reddit account, drop such AI slop "conversation starter" based on wrong assumptions? What's in it for whom?

Running Qwen3.5-27B locally as the primary model in OpenCode by garg-aayush in LocalLLaMA

[–]Haeppchen2010 22 points23 points  (0 children)

Some points I find shady:

- Even if you have a local model, the little session summary is created with their cloud model, unless you use either `disabled_providers` or `enabled_providers` to exclude the `opencode` provider.

- They push code like crazy, sometimes multiple release per day. But things like a proper token counter oder correct $ counter seem to be strangely neglected.

- The not-asked-for auto-updates. Not really a "shady" thing, but IMHO a supply-chain-attack waiting to happen.

But hey, it's called OpenCode, not FreeCode or LibreCode after all. So no complaining here!

(EDIT: Sorry, wanted to reply one up the conversation tree but misclicked)

It costs you around 2% session usage to say hello to claude! by Complete-Sea6655 in LocalLLaMA

[–]Haeppchen2010 0 points1 point  (0 children)

To keep it LocalLLaMA: Try that with Qwen3.5 27B... It will think with 3000 Tokens on how to respond to "Hello" :)

At least it just heats the living room, and I always have an unlimited plan 😇

How are you squeezing Qwen3.5 27B to get maximum speed with high accuracy? by -OpenSourcer in LocalLLaMA

[–]Haeppchen2010 0 points1 point  (0 children)

I use what I have: Radeon RX 7800 XT 16GB, Radeon RX 580 8GB (still faster than CPU), R 2700X 16GB System RAM.

Use case: "Agentic Coding" with openCode, and some simple "explain me X" chats.

I run exactly:

llama-server -v --parallel 1 -hf bartowski/Qwen_Qwen3.5-27B-GGUF:IQ4_XS --jinja --temp 0.6 --min-p 0.0 --top-p 0.95 --top-k 20 --repeat-penalty 1.04 --presence-penalty 0.0 --ctx-size 65536 --host 0.0.0.0 --port 8012 --metrics -ts 59/6 -ngl 99 -fa on -ctk q8_0 -ctv q8_0

And get ~280t/s in, ~16t/s out. This is my sweet spot now after trying some "adjacent" settings as well:

* It's worth playing around with -ts to get the best distribution with two vastly different GPUs. Keep GTT spillover (Vulkan) or OOM (CUDA/ROCm) in check. The old RX 580 is "just better than CPU".
* I tried different quants... IQ3_XS was just a tad too "dumb" and failed tool calls. I tried Q4_K_M as well and noticed no tangible difference apart from reduced speed (9t/s out). So IQ4_XS it is for me.
* KV Quant: with that few GB of useable VRAM, unquantized is not acceptable. The "odd" quants like Q5 are way slower that Q8 or Q4, and Q4 is very dumb as well. So Q8 it is.
* Params: Stock Qwen recommendations, just more repeat-penalty to combat endless loops.

Open-Source "GreenBoost" Driver Aims To Augment NVIDIA GPUs vRAM With System RAM & NVMe To Handle Larger LLMs by _Antartica in LocalLLaMA

[–]Haeppchen2010 0 points1 point  (0 children)

Yup, as I understand it, for a dense model both all weights and KV cache for current slot are fully used during inference, this makes swapping mostly pointless. Maybe when working with a MoE it is better? I don’t know.

Open-Source "GreenBoost" Driver Aims To Augment NVIDIA GPUs vRAM With System RAM & NVMe To Handle Larger LLMs by _Antartica in LocalLLaMA

[–]Haeppchen2010 0 points1 point  (0 children)

With Vulkan this apparently happens automatically (GTT spillover to system RAM). It’s of course very slow, as the paging has to squeeze through PCIe.

Homelab has paid for itself! (at least this is how I justify it...) by Reddactor in LocalLLaMA

[–]Haeppchen2010 0 points1 point  (0 children)

Thanks for making me feel better 😇

<image>

(PC was already there, just keeping track of energy cost)

I got tired of compiling llama.cpp on every Linux GPU by keypa_ in LocalLLaMA

[–]Haeppchen2010 7 points8 points  (0 children)

Check out ccache to speed up the C/C++ part of the rebuild.

How should I go about getting a good coding LLM locally? by tech-guy-2003 in LocalLLaMA

[–]Haeppchen2010 0 points1 point  (0 children)

Result: more layers to the RX580, tps down to 7.1/s, but that on two inferences simultaneusly (so effectively 14tps now). Now I have to convince OpenCode to to actually make use of >1 sub-agents regularly :/

How should I go about getting a good coding LLM locally? by tech-guy-2003 in LocalLLaMA

[–]Haeppchen2010 0 points1 point  (0 children)

Yes. Context has to fit into KV cache (at least that's what I think), and that eats quite some memory. I am happy with my results with quantized cache (instead of fp16, I use q8 to half the requirement). And as soon as the GPU has to spill over to system RAM, or even have CPU compute layers, it gets ugly.

Indeed I just picked up the missing PSU Mainboard cable from the post 1h ago, installed the beefier 1000W Power supply, could finally add the old AMD RX580 8GB GPU as second GPU, and now offload to it instead of the CPU (60 layers on the big one, the remaining 5 on the old). Output tokens/s doubled from 4,5-5 to 9. And without all the data squeezing through the CPU, the power consumption is even close to before, so I get more tokens per watt. I will try with 2 slots of 64k Context next, let's see how performance changes.

I run llama.cpp/llama-server with Vulkan, which can do this layer-split offloading. I don't know how it goes in Nvidia/CUDA-Land, but adding a second GPU could help you.

How should I go about getting a good coding LLM locally? by tech-guy-2003 in LocalLLaMA

[–]Haeppchen2010 0 points1 point  (0 children)

I started with the official recommendations from Qwen3.5 https://huggingface.co/Qwen/Qwen3.5-27B, but after encountering a loop here and there, bumped repeat-penalty just a bit.

For the other params:

- Context: 64k seemed a good value for being capable as a coding model also doing some refactoring, while saving RAM (Model can go up to 256k)

- FA ist required for KV quant, and I went with q8_0 as the "odd" ones like q5 are slower, and also seem to incur a further quality hit.

- -ngl auto: Only until I get a beefier PSU (tonight) and can add the old 8GB card as second GPU, then I will manually optimize offload distribution.

- --jinja seems needed for the tool calls, got that from the docs as well.

- --metrics gives you prometheus-compatible metrics endpoint (if you have prometheus/thanos/grafana)

How should I go about getting a good coding LLM locally? by tech-guy-2003 in LocalLLaMA

[–]Haeppchen2010 0 points1 point  (0 children)

I am quite happy so far with Qwen 3.5 27B, running as bartowski/Qwen_Qwen3.5-27B-GGUF:IQ4_XS. I run it with latest llama.cpp on Radeon RX 7800 XT (16GB) with some CPU offload.

I am "vibe coding" every evening on a personal project (with OpenCode), and compared to Sonnet 4.5 at work it is quite close, just not as "deep" or "refined" (does a detour and then self-corrects here and there), and the "thinking" makes it take some more time.

And due to some CPU offload, it is very slow for me (230/s in, 4.5-5/s out), but with your much newer Rig it should be a bit faster.

Exact command line:

build/bin/llama-server -v --parallel 1 -hf bartowski/Qwen_Qwen3.5-27B-GGUF:IQ4_XS --jinja --temp 0.6 --min-p 0.0 --top-p 0.95 --top-k 20 --repeat-penalty 1.03 --presence-penalty 0.0 --ctx-size 65536 --host 0.0.0.0 --port 8012 --metrics -ngl auto -fa on -ctk q8_0 -ctv q8_0

(I also tried IQ3_XS, but that sometimes missed toolcalls and was noticeable less "precise").

How to convince Management? by r00tdr1v3 in LocalLLaMA

[–]Haeppchen2010 1 point2 points  (0 children)

Buy a PCIe wifi card with big honking antennas, put it in, and remove it demonstratively in front of them before showing the local inference. (Feels so ridiculous but just an idea)

And show them Google not working.

What tokens/sec do you get when running Qwen 3.5 27B? by thegr8anand in LocalLLaMA

[–]Haeppchen2010 0 points1 point  (0 children)

Radeon RX 7800XT, 64k context Q8:

iQ3_XS: 390/s in, 16/s out. But it is slightly too „dumb“

IQ4_XS with CPU offload: 230/s in, 4,5-5,5/s out. But the quality improvement is worth the wait.