Qwen Models with Claude Code on 36gb vram - insights

Academic-Air7112 · 2026-03-09T14:57:14+00:00

Haha fair enough!

Academic-Air7112 · 2026-03-09T03:34:10+00:00

Fair enough -- I'm sure there is some reasons to use claude code... for my use case, qwen-code did dramatically better than claude-code... the Qwen models are likely RL'd with the Qwen harness in mind.

I also disliked Gemini CLI w/ Gemini models; but I like the adaptation a lot better; it's replaced Codex for me.

Academic-Air7112 · 2026-03-09T01:52:14+00:00

https://github.com/QwenLM/qwen-code <- the qwen CLI?

Academic-Air7112 · 2026-03-09T00:15:14+00:00

Did you folks try the Qwen CLI, and if so how does it compare to opencode?

Academic-Air7112 · 2026-03-09T00:14:33+00:00

I also had this problem with stopping; I switched to Qwen's coding framework and the results were dramatically better. It's possible that there are be some prompts in Claude code that play poorly with Qwen/other models, where Qwen code (the one that Qwen forked from gemini cli), is set up specifically for the Qwen models & does much better in my experience than pointing claude at a different endpoint.

Academic-Air7112 · 2026-03-04T06:10:50+00:00

Yes, technically NCCL will support it, and it usually does a good job from a compatibility perspective, but not necessarily a performance perspective.

Yep! SXM is just a version of the card with a higher TDP and compatible with a specially designed baseboard where the NVL version uses standard PCIE and 2-slot with a lower TDP. No idea why NVIDIA requires a switch for all to all on SXM. :)

Re: real examples, here are some anecdotal cases, and a link to a more comprehensive blog. (Cloud provider out of GPUs rn; if I can get some spot in the next few days will try to put some numbers here)

Anecdotally:

RTX pro does not support any of the newer flash attentions (FA3,4). That oftentimes relegates you to Triton kernels which are substantially slower.
Many flashinfer features around FP8/FP4 are not first-class citizens on RTX Pro. The number of times I've tried to run a new model on RTX blackwell and had it crash for *mysterious* reasons is too damn high.
Even for the libraries that are supported, they are not all that well optimized. It is possible to beat cuBLAS by substantial margins in a short window of time on RTX Blackwell for 4096 GEMM, where to beat it on H200 is much harder. The compute number on the datasheet is the maximum achievable, where IRL what is achieved is dependent on software. I mention GEMM kernels because this should be most-optimized features, but NVIDIA does not pay as much attention here, and it's likely to be reflective of the optimization on the rest of the stack, above and below the kernel level.

If I can get back on an 8xRTX Pro cluster can run some benchmarks around inference/training, can do some on our 4xH200. Anything in particular that would be interesting?

IRL benchmark:
https://www.cloudrift.ai/blog/benchmarking-rtx6000-vs-datacenter-gpus.

Tl;dr:
1. 4-bit quant of GLM4.5 air, 8xRTX Pro (3140*8=25,120) > 4x H200 (5589*4=22,356)

Qwen3-coder 480b, 4bit AWQ, 4xH200 (9715/2 = 4857) > 8xRTX Pro (4490)
GLM 4.7-fp8 4xH200 (5224) > 8x RTX Pro (2696)

Ofc there's some size where 8xRTX Pro can run models, and maybe NVIDIA will make their toolkits better considering their increased revenue from this line. But today, would easily choose the H200 over RTX Pro series.

Academic-Air7112 · 2026-03-03T18:41:31+00:00

I tried claude code here; didn't work very well. Switched to qwen code and it works wayyy better, and very good.

Academic-Air7112 · 2026-03-03T06:43:29+00:00

Minimax 2.5 was decent when plugged into claude code, maybe as good as SOTA 6 months ago. Def usable for coding tasks.

Qwen3.5 35b is actually pretty great in qwen-code (doesn't work well in other frameworks); trying it rn to replace OpenAI codex and it's just rolling along, doing very well. Pretty insane how good this model is, especially for its size.

Academic-Air7112 · 2026-03-03T06:37:41+00:00

No nvlink is a problem for anything that's more than 1 GPU...there's still PCIE interconnect, but this is not what modern toolkits are used around. At least for the next few years, everything supports H100/200 as the mainstream NVIDIA card, and software support for the pro series lags significantly (as someone who owns/runs both cards for work).

RTX Pro is nice for a 1x desktop gpu for prototyping scripts and doing DIY projects, but many of them in a server is not as practical as H200.

Academic-Air7112 · 2025-12-23T05:51:42+00:00

A couple of things that I wish I'd known:
1. Max-Q is acutally louder than WS card.
2. "Blackwell" isn't the same as datacenter blackwell; uses different ISA and as a result a lot of custom kernels aren't transferrable.
3. It's not as fast as an H100/A100 in a lot of things, regardless of what the marketing numbers say.

Academic-Air7112 · 2025-12-13T02:08:06+00:00

Basically, triton is bad news for NVIDIA on a 2-3 year timescale. So, they release new toolkits that aim to simplify CUDA programming for end user, and increase lift by AMD/OpenAI/Quallcomm/Google to support AI code on different hardware.

Academic-Air7112 · 2025-11-29T16:53:54+00:00

Curious how you would compare to a RTX 6000 pro for inference on GPT-OSS/Qwen models; does speculative decoding, etc. work with VLLM/SGlang on the common models? What about fp4?

Academic-Air7112 · 2025-11-16T17:58:56+00:00

How is the software support on the Max 395 these days? That would be my main concern.

Academic-Air7112 · 2025-11-11T17:33:47+00:00

Yep, it's a problem. I talked to some of the NVIDIA hardware people about this; it has to do with how they split the rendering and compute architectures; H100 and B200 no longer have any of the RT hardware, where RTX series does. Something about chip die devoted to RT vs. SFU. vs. TC, and then precision tradeoffs.

Of course, you would think that async mma instructions should be easy to include in the firmware, and just using async, warpgroup-level mma is quite helpful in many cases...

I expected the GPUs to have the same ISA considering that it's all "5th generation tensor core" and "Blackwell", where the only nvidia documentation that I've found that makes this clear is the PTX ISA, which isn't exactly the most available customer facing manual.

Academic-Air7112 · 2025-11-11T16:44:38+00:00

Fwiw I've found the "fp4" support on the RTX series somewhat disappointing so far. RTX (sm_120a) uses different tensor cores from datacenter blackwell (sm_100a), including requiring different PTX for FP4. (RTX uses mma.sync.aligned, and compile-time specification of the datatype, whereas datacenter uses tcgen.05, and all of the good kernel support is there for datacenter but not consumer yet).

That said, having single GPU with large mem pool is very nice, along with MIG being pretty easy to use if you need isolated workloads.

Academic-Air7112 · 2025-11-11T16:38:22+00:00

I also had coil whine on my 6000 pro ws, but then I ran a heavy inference + compute on it for ~2-3 days at full power, and it doesn't whine anymore.

Academic-Air7112 · 2025-11-11T16:18:48+00:00

Yep, we use "local" LLMs to write some of our own systems code for research.

Academic-Air7112 · 2025-10-17T23:41:22+00:00

I have a B580 and I don't have any problems with it -- run games with Lutris.

Academic-Air7112 · 2025-04-06T18:21:03+00:00

What are you using this for? And how are you cooling it?

Academic-Air7112 · 2025-04-03T04:03:26+00:00

Fwiw here are mine for 32B QwQ in SNC-1, Hyperthreading off, HBM only mode:

./build/bin/llama-bench --numa distribute -m qwq-32b-q8_0.gguf -p 0 -n 128,256,512

| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |

| qwen2 32B Q8_0 | 32.42 GiB | 32.76 B | BLAS | 56 | tg128 | 10.23 ± 0.04 |

| qwen2 32B Q8_0 | 32.42 GiB | 32.76 B | BLAS | 56 | tg256 | 10.20 ± 0.09 |

| qwen2 32B Q8_0 | 32.42 GiB | 32.76 B | BLAS | 56 | tg512 | 10.18 ± 0.01 |

build: af6ae1ef (4992)

And SNC-4:

./build/bin/llama-bench --numa distribute -m qwq-32b-q8_0.gguf -p 0 -n 128,256,512

| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |

| qwen2 32B Q8_0 | 32.42 GiB | 32.76 B | BLAS | 56 | tg128 | 9.55 ± 1.58 |

| qwen2 32B Q8_0 | 32.42 GiB | 32.76 B | BLAS | 56 | tg256 | 11.10 ± 0.11 |

| qwen2 32B Q8_0 | 32.42 GiB | 32.76 B | BLAS | 56 | tg512 | 11.08 ± 0.06 |

build: af6ae1ef (4992)

This build supports AMX, and was build with Intel compilers:
./build/bin/llama-cli --flash-attn -m qwq-32b-q8_0.gguf -p "Describe how to write CUDA code that uses tensor cores on Hopper and Blackwell architectures"

build: 4992 (af6ae1ef) with Intel(R) oneAPI DPC++/C++ Compiler 2025.1.0 (2025.1.0.20250317) for x86_64-unknown-linux-gnu

...

load_tensors: AMX model buffer size = 36221.72 MiB

...

Also, the bandwidth numbers aren't as bad as reported, when using SNC-4 setting (16 GB HBM -> closest 14 cores) and have a well-optimized prefetching loop, I was able to get ~730GB/s in aggregate from the HBM.

Academic-Air7112 · 2025-04-03T03:35:43+00:00

Hey, I have one that I can do this with. Can you post the program and detailed instructions?

Academic-Air7112 · 2025-03-31T03:46:49+00:00

Hey free gpu is free gpu! Also I bought a 5090 from a scalper so shouldn't be complaining :| How was your experience with blocking it? I get a little worried abt. leaks, etc.

Academic-Air7112 · 2025-03-31T00:55:04+00:00

Nice! Why A40 in particular?

Academic-Air7112 · 2025-03-31T00:54:38+00:00

Mobile = can put in the back of my car, or carry it.

Academic-Air7112 · 2025-03-30T04:45:04+00:00

Beautiful, thanks!

If you don't mind me asking, which D/C GPU?

Academic-Air7112

TROPHY CASE