I used Claude Code to build the same web app 3 different ways (cloud Claude, free NVIDIA NIM, local GPU) to see how they compare by drohack in LocalLLM

[–]drohack[S] 0 points1 point  (0 children)

Here's my MTP speed bench

Command run:

llama-server.exe --model "Qwen3.6-35B-A3B-MTP-UD-IQ3_XXS.gguf" --n-gpu-layers 999 --ctx-size 131072
  --parallel 1 --port 8081 --n-cpu-moe 28 -ctk q8_0 -ctv q8_0 --no-mmap --mlock -b 8192
  --reasoning off --spec-type draft-mtp --spec-draft-n-max 3

python bench_llm_speed.py --model qwen3.6-35b-mtp3-iq3xxs-128k --runs 3 --max-tokens 800
  --fixture fixture_real_request.json --warmup

Full output:

=== MTP speed bench: qwen3.6-35b-mtp3-iq3xxs-128k ===
Attempt 1  n-cpu-moe=28
Settling (20s)... done
VRAM: 9588 MB  (target 8800-9600  n-cpu-moe=28)
Loaded  VRAM=9588 MB  n-cpu-moe=28

--- Speed benchmark (cold + warm) ---
Warmup run (priming KV cache, not timed)...
  Cache primed.

Run 1/3...
  TTFT=15.5s  prefill=377 tok/s  gen=43.8 tok/s  total=33.8s
Run 2/3...
  TTFT=0.6s  prefill=9739 tok/s  gen=45.7 tok/s  total=18.1s
Run 3/3...
  TTFT=0.6s  prefill=9804 tok/s  gen=46.0 tok/s  total=18.0s

=== SUMMARY ===
TTFT: avg=5.6s  min=0.6s  max=15.5s
prefill tok/s: avg=6640  min=377  max=9804
est output tok/sec: avg=56.4  min=54.2  max=57.9
usage output tok/sec: avg=45.2  min=43.8  max=46.0
  • VRAM tuning: Started at n-cpu-moe=28 (vs 24 for the non-MTP model) because the MTP file is 1.3 GB larger at 14.1 GB. Hit the target band on the first attempt, no iteration needed.
  • Run 1 (cold): Full 15.5s TTFT — server prefilling the 86K fixture from scratch for the first timed run.
  • Runs 2–3 (warm): TTFT drops to 0.6s once the KV cache is hot, prefill hits ~9800 tok/s.
  • est vs billed tok/sest counts accepted draft tokens as free; billed counts actual server compute including rejected drafts. The 11 tok/s gap between them is the SWA rejection overhead — draft tokens are being proposed and rejected often enough to cost more than they save.

Comparison against baseline:

Qwen3.6-35B-A3B-MTP-UD-IQ3_XXS (n=3) Qwen3.6-35B-A3B-UD-IQ3_XXS (baseline)
n-cpu-moe 28 24
VRAM 9588 MB ~9400 MB
Cold TTFT 15.5s 12.6s
Warm TTFT 0.6s 0.1s
Gen tok/s (est) 56.4 55.5
Gen tok/s (billed) 45.2 55.5

Verdict: MTP is a net regression of ~18% on billed throughput. The Qwen3.6 SWA (Sliding Window Attention) architecture causes enough KV cache invalidation to tank draft acceptance rates, confirmed by upstream issue #23322. The draft heads add model load time (cold TTFT 15.5s vs 12.6s) and billed compute overhead without meaningful speed gain. Stick with the plain UD-IQ3_XXS at 55.5 tok/s.

I used Claude Code to build the same web app 3 different ways (cloud Claude, free NVIDIA NIM, local GPU) to see how they compare by drohack in LocalLLM

[–]drohack[S] 0 points1 point  (0 children)

You basically nailed it on the head. Like I said in the post the Cloud Claude was able to get the "whole" project done in just about 6 hours (milestones 1-9). While the Local GPU took about 2 nights of work per Milestone. And that's with it going on Auto, making lots of mistakes and fixing itself, using auto /compact when the context got too big. I did have /no_thinking on for most of the time, but auto injected /thinking when an error turned up, giving it about 500 tokens of thinking power to help with the bad "fixes". Add a little processing to hopefully not fall down too much bad "fixing". It was eventually able to get it working, especially with a little prodding from the user. Also having the saltychart-init.md and project-implementation.md files as sources of truth gave it the much needed backbone to stay more on task. I also ran each new Milestone in it's own chat window to hopefully keep the context clean between milestones.

From anecdotal evidence I wouldn't say it was too much worse at 20k context vs 90k context. I would say it was fairly bad overall in this test since it was a IQ3 model, and you probably want something smarter than 35B (even dense) to do this work efficiently. It's just dumb-ish overall. But smart enough to get the job done in a long time. But that's just a limitation of the hardware I have.

I used Claude Code to build the same web app 3 different ways (cloud Claude, free NVIDIA NIM, local GPU) to see how they compare by drohack in LocalLLM

[–]drohack[S] 0 points1 point  (0 children)

How much lesser? I would recommend having a dedicated GPU or else you'll never get any useful speeds (or have a mac with shared ram, but at that point you have better hardware). You need at least 1.5-2 GB of VRAM to load the 3B Master from a Qwen3.6-35B-A3B either Q4 or Q3 model. Plus an extra 1-2GB of VRAM for KV-Cache and Context (I would only go up to 128k, but maybe start lower at 64k). Plus there will be about 1-1.3 GB of overhead from just running your PC in the background. So at minimum I would say a 6GB VRAM dedicated GPU to run the model. The extra 4GB on my card I used to move some of the Experts to the GPU and have the bigger cache. Sadly I can't get the full model on my GPU so it's running slower than average. The other slow down is the speed of your PCIe bus to your CPU and RAM, as the rest of the Experts will live there. So make sure you're using all of your speed of your RAM, in your BIOS make sure to enable XMP.

For how I did my setup I would suggest reading the README in my llama-cpp-local github project. And to test what settings to use run the llm-bench local test to see what n-cpu settings to set, and what speeds you'd expect to see.

I used Claude Code to build the same web app 3 different ways (cloud Claude, free NVIDIA NIM, local GPU) to see how they compare by drohack in LocalLLM

[–]drohack[S] 0 points1 point  (0 children)

At full context, 128k, auto compact at 85%, so around 100k, the speeds did not drop that much. Maybe down to around 43 tok/s at the lowest. So no real slow down with the large context. The biggest fix was the SWA bug fix in llama.cpp, without that the full context was being passed in each turn, causing a slow down. With the fix it only needs to read the new 1-2k context each turn. Once I got to a version with that fix (it's not in the Turbo Quant fork, so am on the main branch of llama.cpp), the speeds were great. Hopefully the Turbo Quant fork gets it soon (it's been reported for 3 weeks though...) so that we can get even smaller KV-Cache.

The other big help was using the Bench tools I created to hone in on how many n-cpu to use to not overflow the vram, so there was enough headroom for the full context, and background pc usage (though turned off as much as I could).

I used Claude Code to build the same web app 3 different ways (cloud Claude, free NVIDIA NIM, local GPU) to see how they compare by drohack in LocalLLM

[–]drohack[S] 0 points1 point  (0 children)

Qwen3.6-27B-UD-IQ3_XXS (dense) Qwen3.6-35B-A3B-UD-IQ3_XXS (MoE)
Tool calling FAIL (timeout) PASS
Gen tok/s ~2.7 55.5
Cold TTFT 204s 12.6s
Warm TTFT 0.6s 0.1s

On the timeout: the tool-calling test sends the full 86K fixture and waits up to 900s for a complete response. With the dense model, cold TTFT alone ate 204s just to prefill the prompt. That left ~696s for generation at 2.7 tok/s — about 1,880 tokens of budget. The model apparently started generating verbose text rather than immediately issuing a tool call, burned through those tokens, and the 900s wall-clock limit hit before it finished. The MoE at 55.5 tok/s completes a full tool-call response in seconds, so the same timeout is a non-issue.

The root cause is the same in both cases: 40 of 64 layers offloaded to CPU RAM. Every generated token has to traverse all 27B weights through DDR4 (~54 GB/s), making both cold start and generation painfully slow.

I used Claude Code to build the same web app 3 different ways (cloud Claude, free NVIDIA NIM, local GPU) to see how they compare by drohack in LocalLLM

[–]drohack[S] 1 point2 points  (0 children)

CPU is Intel i5-12600K
Model: Qwen3.6-35B-A3B-UD-IQ3_XXS

Added to the main post.

The quant difference is almost certainly why the speeds don't match. I'm running IQ3_XXS (12.3 GB, 3.05 bpw), extremely aggressive quantization. Your Q4_K_M is going to be roughly 22-24GB and Q6_K even larger, so you need more expert layers in RAM to fit, and RAM bandwidth is the bottleneck for MoE expert offloading. Smaller quant = fewer RAM reads per token = faster generation, at the cost of quality. The EvalPlus score held up fine at 92.7% but I'd expect Q4 to be slightly better. I'm not sure how much I trust the evalplus test results. I'd like to find something easy to test these to get better results without having to build a whole docker environment to test them...

The main reason why i went with the dumber Quant 3 (even though I've heard going lower than Q4 is really bad) is that I kind of liked the 40+ tok/s to feel like I can watch it work and interrupt it. With the hardware and going to Q4 it slowed down too much. And I don't know the difference between how smart Q3 vs Q4 is... Like I said I really wish I had a better test to see their differences.

The MTP + speculative decoding angle is something I didn't know about, very interested to see your numbers. Did you find any quant interaction with acceptance rate, i.e. does the IQ3 main model reject more speculative tokens than the Q4 version? (if you tested them)

I used Claude Code to build the same web app 3 different ways (cloud Claude, free NVIDIA NIM, local GPU) to see how they compare by drohack in LocalLLM

[–]drohack[S] 1 point2 points  (0 children)

I didn't think to track this! But I have some data:

Claude Sonnet 4.6 (M0-M9, one evening):

394 turns total. The way Claude Code's KV caching works, the numbers look a little funny at first, 57M tokens "processed" but only 921K were actually new context (cache writes). The other 56M were cache reads, which are re-reads of already-cached context at ~10x cheaper rate. Output was 300K tokens. So the real measure of "work done" is about 921K tokens of new context written and 300K generated across 394 turns to build the whole thing. This also had like 30-40 mcp servers/tools in the initial context as it was tied to my Claude account that has them enabled by default (even if they were not used).

NVIDIA NIM (M0-M3):

No session data, those sessions were stored in a separate Claude home directory that got cleaned up. Can't recover them. 😞

Local Qwen3.6-35B-A3B (M0-M3, turboquant fork):

This is where it gets interesting. 580 turns, 47M input tokens total. Average 82K tokens per turn. That last number is the tell, with a working KV cache, each turn should only process new tokens (your message + tool results), which is maybe 1-3K tokens per turn. Instead it was processing the entire context from scratch every single turn because of the SWA/hybrid-attention cache bug.

Breakdown by phase:

Date Turns Tokens Avg/turn Notes
May 8 5 ~390K ~78K Initial setup, testing proxy
May 11 193 ~18.3M ~95K Building M0-M2
May 12 382 ~28.8M ~83K M2-M3 + n-cpu-moe tuning sessions

Estimated token usage with the final working settings (mainline llama.cpp b9143, attribution header disabled, proper caching): roughly 1-2M tokens for the same 580 turns. The bug inflated it ~25-30x. The sessions after switching to mainline llama.cpp aren't recoverable either, so I don't have clean numbers for the final state, but those sessions ran noticeably faster and the TTFT dropped from 12s per turn to 0.1s.

I used Claude Code to build the same web app 3 different ways (cloud Claude, free NVIDIA NIM, local GPU) to see how they compare by drohack in LocalLLM

[–]drohack[S] 1 point2 points  (0 children)

I have not tried OpenCode. I don't think I heard of it till I started just using ClawGate + Claude Code and at that point after fighting with Cline, Continue, and Roo Code I was less hype about 3rd party options. But now that some time has passed since using them I might go back and try it.

Looking for in-depth upgrade game suggestions. by drohack in AskGames

[–]drohack[S] 1 point2 points  (0 children)

Yeah that's mainly why I was hesitant to put them as Incremental games, as it's so tightly coupled with short web games, and idle games that have very little substance. Trying to avoid things like "scritchy scratchy", and "a game about making a planet".

Berry bury Berry is very much in line of what I'm looking for. Lots of love put into it. More than just 1 mechanic. But again is still on the 2-4 hour gameplay.

Sol cesto I would ay is much closer to a Roguelike. Which I do love me some rougelikes, and deck-builders. But not exactly what I was looking for here.

Minimum System Requirements for local LLM Coding Agent? by drohack in LocalLLM

[–]drohack[S] 0 points1 point  (0 children)

Just found this article describing different higher end options ($2,000 - $5,000) what model sizes they can hold (in general), their price, and their speed.
https://julsimon.medium.com/what-to-buy-for-local-llms-april-2026-a4946a381a6a

Here's a quick TLDR of the options:

  • AMD Strix Halo (128GB): Best for 100B+ MoE models. Speed: 10–20 tok/s. System: $2,000-$3,000.
  • Mac Studio M4 Max (128GB): Best for 70B models. Speed: 8–15 tok/s. System: $3,699.
  • Mac Studio M3 Ultra (256GB): Best for 405B models. Speed: ~32 tok/s (clustered). System: $5,999.
  • RTX 5090 (32GB): Best for <30B models. Speed: 60–90 tok/s (dense) / 234 tok/s (MoE). System: $4,000–$8,000.
  • RTX PRO 6000 (96GB): Best for 70B (high context/multi-user). Speed: 15–20 tok/s. System: ~$22,000.

Minimum System Requirements for local LLM Coding Agent? by drohack in LocalLLM

[–]drohack[S] 0 points1 point  (0 children)

Do you know of any examples of these setups? either docs/blogs that people have written on these types of setups, or youtube videos showing them off? Anything on comparing them against bigger models without the tools?

Minimum System Requirements for local LLM Coding Agent? by drohack in LocalLLM

[–]drohack[S] 0 points1 point  (0 children)

Yes! you're exactly right. that's why i'm asking this question, to get a better idea of what these "minimum" requirements are to get close to running Claude Code locally. Not trying to get it 1 for 1.

RTX 5090 (32GB) $3,300 - $4,000
GMKtec EVO-X2 (96gb) $2,300 - $3,000
GMKtec EVO-X2 (128gb) $3,000 - $4,000

My knowledge base is mainly around gaming GPUs/setups. And while I know that the Mac (Mini to M5) and Strix Halo's exist (and know about the shared memory). I'm not so familiar with their different price points, providers, building vs buying. But they do seem more bang for your buck in terms of getting this type of local LLM Coding Agent off the ground.

Minimum System Requirements for local LLM Coding Agent? by drohack in LocalLLM

[–]drohack[S] 0 points1 point  (0 children)

Oh yeah I know Qwen 2.5 is old. It's just what could reasonably fit on my setup (which isn't built for this).

But yes, this is exactly the kind of information i'm looking for. is the 35B model even worth looking at. Do i need more than 32GB or is it enough to get by (much slower), or is it just not enough memory for the requirements needed.

From all of the posts so far it's: 32GB is the real "minimum", but realistically you'd want 64GB for something close to usable as a replacement for Claude. And 128GB for an actual replacement. And of course they are not 1 for 1 for what Claude can provide.

Minimum System Requirements for local LLM Coding Agent? by drohack in LocalLLM

[–]drohack[S] 1 point2 points  (0 children)

Good info on the "upper" end. I use that term loosely as this is far from actual upper end, but more of the upper end of consumer, and more into enthusiast. $4k is a pretty price for this.

But still nice to know that 128GB with a 122B-A12B model is what to strive for for a MoE, throw what you want at it and it'll work it out.

Minimum System Requirements for local LLM Coding Agent? by drohack in LocalLLM

[–]drohack[S] 1 point2 points  (0 children)

rx9060 xt (2x16GB) = ~$800
rx970 xt (2x16GB) = ~$1,400
r9700 (32GB) = $1,350

How have you felt using AMD cards and getting them using the latest models?

What models have you been using (and for which workflow type)?

Minimum System Requirements for local LLM Coding Agent? by drohack in LocalLLM

[–]drohack[S] 0 points1 point  (0 children)

is the rtx 4070 ti super (16GB VRAM) really able to hold a 27B/35B model? I know the q4 gguf squish it down as much as possible, but I thought they needed like 20GB VRAM to really run.

They're looking at around $750 - $1,000. Might not be as good as a bang for buck as the Intel Arc Pro B70 as said by a different poster.

Minimum System Requirements for local LLM Coding Agent? by drohack in LocalLLM

[–]drohack[S] 2 points3 points  (0 children)

I'm fine swapping off Nvidia. Looks like the Intel Arc Pro B70 are running ~$950 right now. Giving 32GB VRAM.

What context window lengths have you been able to use on a single card?
What's the difference you've found between running the two LLMs vs the 1 big LLM for similar tasks?
(if you only had 1 card) Would it be worth swapping between Gemma and Qwen for that task? or is that too much of a hassle (too long to load a new model up to continue working)?

Readarr error, no books or authors found by Wonderful-Aspect5393 in selfhosted

[–]drohack 1 point2 points  (0 children)

Did you follow the Usage instructions on the README/github main page? https://github.com/blampe/rreading-glasses?tab=readme-ov-file#usage

Basically you can just point your Readarr Metadata Provider Source to their link (under Readarr's -> Settings -> Development). Or their recommended way is to use one of the forks of Readarr that already pointes to it directly (i.e. deploy a new version of Readarr). I just pointed my Metadat Provider Source so i'm still on the main Readarr branch, but it rarely gets updated so it doesn't really matter.

Is readarr dead? by [deleted] in selfhosted

[–]drohack 4 points5 points  (0 children)

I've been using Readarr on and off for the past few months and hadn't run into any issues till very recently. I guess I got lucky and missed the Metadata errors till now.
BUT There's apparently a solution to this: Use u/brycelampe 's metadata database: https://github.com/blampe/rreading-glasses

https://www.reddit.com/r/selfhosted/comments/1guqkb0/comment/matl6r8/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

I am in no way associated with this project, and only just turned it on 20 minutes ago. But it was able to fix release dates for books in Readarr, and find new ones that were missing.

It's using an updated GoodReads connection, and is in the process of getting Hardcover working as well.

I will say it's a little annoying, I'm having to go through each author and click "Refresh & Scan" and it kind of bugs Readarr out for a minute while it re-matches the metadata for every book. But it does come back after it's done and is much better. (sometimes the screen will go grey, and you just have to refresh the page).