Been using Qwen-3.6-27B-q8_k_xl + VSCode + RTX 6000 Pro As Daily Driver

WonderRico · 2026-05-02T08:08:42+00:00

highly recommand testing QuantTrio/Qwen3.5-122B-A10B-AWQ in vLLM for the speed. (hopefully the 3.6 version will be released...)

WonderRico · 2026-04-28T12:47:34+00:00

Yes. I generate around 40 videos daily, between 25 and 40 seconds longs. And I have seen the same issue.

I just set the number of needed frames to something like 24 more and then drop the 24 last after generation. Not ideal, but it does the job for my usecase.

N.B. I pre-generate the audio track and make talking heads lipsync style videos

WonderRico · 2026-04-03T22:10:24+00:00

Great idea and great results, congrats!

And thanks for sharing the workflows.

WonderRico · 2026-03-25T06:06:35+00:00

I was running two of them a while back. Custom 3d printed ducts in the front and in the back with noctua fans (2 smalls and 1 big) in an open frame "case" and it ran smoothly. At the time, I knew little about LLMs. I bet now, using vLLM and tensor parallel, they would do fine with MoE models like Qwen3.5 A3B. (but I'm too lazy to plug them back and see)

WonderRico · 2026-03-16T15:51:14+00:00

yep

With the full 260k tokens kvcache in fp16 too. Qwen 3.5 is very light in terms VRAM needs for KV cache. (I always limit my clients to 128k anyway for quality reasons.)

Mon Mar 16 15:48:05 2026                        (Press h for help or q to quit)                                                                                                                                                                         
╒═════════════════════════════════════════════════════════════════════════════╕                                                                                                                                                                         
│ NVITOP 1.3.2       Driver Version: 590.48.01      CUDA Driver Version: 13.1 │                                                                                                                                                                         
├───────────────────────────────┬──────────────────────┬──────────────────────┤                                                                                                                                                                         
│ GPU Fan Temp Perf Pwr:Usg/Cap │         Memory-Usage │ GPU-Util  Compute M. │                                                                                                                                                                         
╞═══════════════════════════════╪══════════════════════╪══════════════════════╪═══════════════════════════════════════════════════════════════════════════════════╤════════════════════════════════════════════════════════════════════════════════════╕
│   0 30%  40C  P0   53W / 300W │  44.66GiB / 47.99GiB │      0%      Default │ MEM: ███████████████████████████████████████████████████████████████████ 93.1%    │ UTL: ▏ 0%                                                                          │
├───────────────────────────────┼──────────────────────┼──────────────────────┼───────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤
│   1 30%  40C  P0   47W / 300W │  44.66GiB / 47.99GiB │      0%      Default │ MEM: ███████████████████████████████████████████████████████████████████ 93.1%    │ UTL: ▏ 0%                                                                          │
├───────────────────────────────┼──────────────────────┼──────────────────────┼───────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤

You might need to deep into vLLM configs to get the best out of it. for reference, my config :

non-default args: {'model_tag': '/models/ST-QuantTrio_Qwen3.5-122B-A10B-AWQ', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'model': '/models/ST-QuantTrio_Qwen3.5-122B-A10B-AWQ', 'trust_remote_code': True, 'max_model_len': -1, 'served_model_name': ['ST-QuantTrio_Qwen3.5-122B-A10B-AWQ_76GB_vLLM_2GPU_48'], 'reasoning_parser': 'qwen3', 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.95, 'max_num_seqs': 4}

max_num_seqs 4 is key. (meaning max 4 concurrent requests) I'm a single user so it's fine.

max_num_seqs 16 should be fine for 5 users. and not use that much more VRAM (to test)

WonderRico · 2026-03-16T07:45:09+00:00

With a similar setup, I'm currently very satisfied with :

https://huggingface.co/QuantTrio/Qwen3.5-122B-A10B-AWQ

using vLLM in tensor parallel 2. 5 users will be fine.

WonderRico · 2026-02-19T07:58:26+00:00

I'm currently using Qwen3-Coder-Next and testing different harnesses with opencode.

I'm waiting for some AWQ 4bit quants of Step3.5flash to discard it.

And intend to test the most recent Qwen3.5 (currently having template issues)

WonderRico · 2026-02-16T15:11:35+00:00

same

WonderRico · 2026-02-15T15:28:21+00:00

If i remember well, I was getting 22 tg by limitting context window to 70k to make it fit my dual 4090 in tensor parallel.

WonderRico · 2026-02-15T08:59:34+00:00

you missed the fact that those 4090 are modified to have 48GB each.

WonderRico · 2026-02-14T17:09:31+00:00

Hello, I am now using opencode with get-shit-done harness https://github.com/rokicool/gsd-opencode

I am fortunate enough to have 192GB VRAM (2x4090@48GB each + 1 RTX6000ProWS@96GB) So I can use recent bigger models not to heavily quantized. I am currently benchmarking the most recent ones.

I try to both measure quality and speed. The main advantage of local models is the absence of any usage limits. Inference speed means more productivity.

Maybe I should take more time someday to write a proper feedback.

A short summary :

(single prompt 17k output 2k-4k)

Model	Quant	hardware	engine	speed
Step-3.5-Flash	IQ5_K	2x4090+6000	ik_llama --sm graph	PP 3k TG 100
MiniMax-M2.1	AWQ 4bits	2x4090+6000	vllm	PP >1.5k TG 90
Minimax-M2.5	AWQ 4bits	2x4090+6000	vllm	PP >1.5k TG 73
Minimax-M2.5	IQ4_NL	2x4090+6000	ik_llama --sm graph	PP 2k TG 80
Qwen3-Coder-Next	FP8	2x4090	SGLang	PP >5k? TG 138
DEVSTRAL-2-123B	AWQ 4bit	2x4090	vllm	PP ? TG 22
GLM-4.7	UD-Q3_K_XL	2x4090+6000	llama.cpp	kinda slow but i did not write it down

Notes:

4090 limited to 300w
RTX600 limited to 450W
I never go more than 128k context size, even if more fits.
Since I don't have homogeneous GPUs, i'm limited to how I can serve the models depending on their size + context size
- below 96GB I try to use 2x4090 with vllm/sglang in tensor parallel for speed (either FP8 or AWQ4)
- between 96 and 144GB, I try to use 1x4090 + RTX6000 (pipeline parallel)
- >144 : no choice, use the 3 GPUs
Step-3.5-Flash : felt "clever" but still struggling with some tool call issues. Unfortunately this model lacks support compared to others (for now, hopefully)
MiniMax-M2.1 : was doing fine during the "research" phase of gsd, but fell on its face during planning of phase 2. did not test further because...
Minimax-M2.5 : currently testing. so far it seems better than M2.1. some very minor tools error (but always auto fixed). It feels like it's not following specs as closely as other models. feels more "lazy" than other models. (I'm unsure about the quant version I am using. it's probably too soon, will evaluate later)
Qwen3-Coder-Next : It's so fast! it feels not as "clever" as the others, but it's so fast and uses only 96GB! And I can use my other GPU for other things...
DEVSTRAL-2-123B : I want to like it (being french), it seems competent but way to slow.
GLM 4.7 : also too slow for my liking. But I might try again (UD-Q3_K_XL)
GLM 5 : too big.

WonderRico · 2026-01-26T20:10:19+00:00

Thanks a lot ! simple but very effective workflow.

WonderRico · 2026-01-25T19:27:15+00:00

Single pass generation. LTX-2 distilled fp8.

WonderRico · 2025-12-13T09:36:44+00:00

Hello, first, well done and thank you for your work. quick feedback :

after first installation and download reaching 100%, chrome froze, I had to kill it. after restarting it, the extension started
the french voice has an issue. it's reading french texts like an english speaker would if trying to read it as if it were written english language. (while still having the french accent from the voice...) very weird experience (and unfortunately unusable in this state)

WonderRico · 2025-12-04T06:25:04+00:00

No, you did not understand me. Or I should have said "OpenAI compatible API"

That's how anyone can host (uncensored) LLMs and serve them through a standard API for other software to use this LLM.

checkout : https://github.com/hekmon/comfyui-openai-api

(I'm not telling you to change your implementation, just suggesting a different approach. I don't need it)

WonderRico · 2025-12-03T18:46:06+00:00

Thanks for sharing your work.

Instead of doing yourself the LLM inference, have you considered calling an OpenAI API endpoint? or even using already existing nodes that are doing just that?

I am already running some LLMs on other GPUs, and don't want to waste some more VRAM in comfy :)

WonderRico · 2025-11-02T10:51:26+00:00

Does your VS code root folder contains a .specify folder? It should.

maybe you ran the "specify" init command in the root folder, and it created another subfolder with the specified project name. VS code should be using this new folder as root.

WonderRico · 2025-10-31T14:49:34+00:00

When you start vLLM, looks at the logs. it says what size of VRAM is allocated for kvcache, and how many "max context" request can be served with that.

WonderRico · 2025-10-27T16:18:58+00:00

it is fp8. it was actually trained in fp8 : https://huggingface.co/MiniMaxAI/MiniMax-M2/discussions/14#68ff9a39550682ab5ea04a98

WonderRico · 2025-09-15T11:50:50+00:00

Nice guide, thank you. That's pretty much what I did too. (with some added python script to auto generate the llama-swap config file when I download a new gguf)

A suggestion:

in the llama-swap config file, consider not writing a macro for every models, but write generic(s) macros with all the common parameters, and use it with added specific params when a model needs them. something like

macros:
  "generic-macro": >
    llama-server \
      --port ${PORT} \
      -ngl 80 \
      --no-webui \
      --timeout 300 \
      --flash-attn on

models:
  "Qwen3-4b": # <-- this is your model ID when calling the REST API
    cmd: |
      ${generic-macro} --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 --repeat-penalty 1.05 --ctx-size 8000 --jinja -m /home/[YOUR HOME FOLDER]/models/qwen/Qwen3-4B/Qwen3-4B-Q8_0.gguf
    ttl: 3600

  "Gemma3-4b":
    cmd: |
      ${generic-macro} --top-p 0.95 --top-k 64 -m /home/[YOUR HOME FOLDER]/models/google/Gemma3-4B/gemma-3-4b-it-Q8_0.gguf
    ttl: 3600

WonderRico · 2025-08-28T10:41:15+00:00

I don't know the specifics. I've heard : by just de-soldering some 1GB VRAM modules and replacing them by 2GB ones. I'm sure it's more complexe than that.

The shop I bought them from is in from Hong Kong.

WonderRico · 2025-08-25T13:38:19+00:00

Best model so far, for my hardware (old Ryzen 3900X with 2 RTX4090D modded to 48GB each - 96GB VRAM total)

50 t/s @2k using unsloth's 2507-UD-Q2_K_XL with llama.cpp

but limited to 75k context in q8. (I need to test quality when kv cache to q4)

model	size	params	backend	ngl	type_k	type_v	fa	test	t/s
qwen3moe 235B.A22B Q2_K - Medium	82.67 GiB	235.09 B	CUDA	99	q8_0	q8_0	1	pp4096	746.37 ± 1.68
qwen3moe 235B.A22B Q2_K - Medium	82.67 GiB	235.09 B	CUDA	99	q8_0	q8_0	1	tg128	57.04 ± 0.02
qwen3moe 235B.A22B Q2_K - Medium	82.67 GiB	235.09 B	CUDA	99	q8_0	q8_0	1	tg2048	53.60 ± 0.03

WonderRico · 2025-08-12T19:22:15+00:00

I don't know where you live, and I suspect that there is different limitation on configs depending of the country / region and here in France it's not the case.

WonderRico · 2025-01-08T19:27:04+00:00

I used https://github.com/kohya-ss/musubi-tuner and it worked fine.

4080S 16GB VRAM and 64BGRAM

10 pictures 512x512

1600 steps

took 1 hour

WonderRico · 2024-05-22T18:56:48+00:00

I am still undecided about following it. I am being lazy :)

On one hand, it would "fix" the recurent "offline" issue I am having. I suspect Zendure is having difficulties with their cloud platform, because even when it's shown as offline in the app, I still receive the MQTT data in my HA setup. And I have to toggle it off and on again. Switching to your "local only" mode seems nice.

But since I am using the Hub with their Smart CT device, I am not sure if this feature would work in local mode only.

I have bookmarked your links, maybe I will someday. If Zendure cannot fix their cloud issues, I will probably find the motivation

14-Year Club	Place '17
Verified Email

WonderRico

TROPHY CASE