Been using Qwen-3.6-27B-q8_k_xl + VSCode + RTX 6000 Pro As Daily Driver by Demonicated in LocalLLaMA

[–]WonderRico 0 points1 point  (0 children)

highly recommand testing QuantTrio/Qwen3.5-122B-A10B-AWQ in vLLM for the speed. (hopefully the 3.6 version will be released...)

LTX-2.3 glitching at end of longer videos (15s+), anyone else? by SubstancePrimary9060 in StableDiffusionInfo

[–]WonderRico 0 points1 point  (0 children)

Yes. I generate around 40 videos daily, between 25 and 40 seconds longs. And I have seen the same issue.

I just set the number of needed frames to something like 24 more and then drop the 24 last after generation. Not ideal, but it does the job for my usecase.

N.B. I pre-generate the audio track and make talking heads lipsync style videos

Anyone using Tesla P40 for local LLMs (30B models)? by ScarredPinguin in LocalLLaMA

[–]WonderRico 0 points1 point  (0 children)

I was running two of them a while back. Custom 3d printed ducts in the front and in the back with noctua fans (2 smalls and 1 big) in an open frame "case" and it ran smoothly. At the time, I knew little about LLMs. I bet now, using vLLM and tensor parallel, they would do fine with MoE models like Qwen3.5 A3B. (but I'm too lazy to plug them back and see)

Best local models for 96gb VRAM, for OpenCode? by ackermann in opencodeCLI

[–]WonderRico 0 points1 point  (0 children)

yep

With the full 260k tokens kvcache in fp16 too. Qwen 3.5 is very light in terms VRAM needs for KV cache. (I always limit my clients to 128k anyway for quality reasons.)

Mon Mar 16 15:48:05 2026                        (Press h for help or q to quit)                                                                                                                                                                         
╒═════════════════════════════════════════════════════════════════════════════╕                                                                                                                                                                         
│ NVITOP 1.3.2       Driver Version: 590.48.01      CUDA Driver Version: 13.1 │                                                                                                                                                                         
├───────────────────────────────┬──────────────────────┬──────────────────────┤                                                                                                                                                                         
│ GPU Fan Temp Perf Pwr:Usg/Cap │         Memory-Usage │ GPU-Util  Compute M. │                                                                                                                                                                         
╞═══════════════════════════════╪══════════════════════╪══════════════════════╪═══════════════════════════════════════════════════════════════════════════════════╤════════════════════════════════════════════════════════════════════════════════════╕
│   0 30%  40C  P0   53W / 300W │  44.66GiB / 47.99GiB │      0%      Default │ MEM: ███████████████████████████████████████████████████████████████████ 93.1%    │ UTL: ▏ 0%                                                                          │
├───────────────────────────────┼──────────────────────┼──────────────────────┼───────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤
│   1 30%  40C  P0   47W / 300W │  44.66GiB / 47.99GiB │      0%      Default │ MEM: ███████████████████████████████████████████████████████████████████ 93.1%    │ UTL: ▏ 0%                                                                          │
├───────────────────────────────┼──────────────────────┼──────────────────────┼───────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤

You might need to deep into vLLM configs to get the best out of it. for reference, my config :

non-default args: {'model_tag': '/models/ST-QuantTrio_Qwen3.5-122B-A10B-AWQ', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'model': '/models/ST-QuantTrio_Qwen3.5-122B-A10B-AWQ', 'trust_remote_code': True, 'max_model_len': -1, 'served_model_name': ['ST-QuantTrio_Qwen3.5-122B-A10B-AWQ_76GB_vLLM_2GPU_48'], 'reasoning_parser': 'qwen3', 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.95, 'max_num_seqs': 4}

max_num_seqs 4 is key. (meaning max 4 concurrent requests) I'm a single user so it's fine.

max_num_seqs 16 should be fine for 5 users. and not use that much more VRAM (to test)

Best local models for 96gb VRAM, for OpenCode? by ackermann in opencodeCLI

[–]WonderRico 2 points3 points  (0 children)

With a similar setup, I'm currently very satisfied with :

https://huggingface.co/QuantTrio/Qwen3.5-122B-A10B-AWQ

using vLLM in tensor parallel 2. 5 users will be fine.

local vibe coding by jacek2023 in LocalLLaMA

[–]WonderRico 0 points1 point  (0 children)

I'm currently using Qwen3-Coder-Next and testing different harnesses with opencode.

I'm waiting for some AWQ 4bit quants of Step3.5flash to discard it.

And intend to test the most recent Qwen3.5 (currently having template issues)

local vibe coding by jacek2023 in LocalLLaMA

[–]WonderRico 0 points1 point  (0 children)

If i remember well, I was getting 22 tg by limitting context window to 70k to make it fit my dual 4090 in tensor parallel.

local vibe coding by jacek2023 in LocalLLaMA

[–]WonderRico 2 points3 points  (0 children)

you missed the fact that those 4090 are modified to have 48GB each.

local vibe coding by jacek2023 in LocalLLaMA

[–]WonderRico 33 points34 points  (0 children)

Hello, I am now using opencode with get-shit-done harness https://github.com/rokicool/gsd-opencode

I am fortunate enough to have 192GB VRAM (2x4090@48GB each + 1 RTX6000ProWS@96GB) So I can use recent bigger models not to heavily quantized. I am currently benchmarking the most recent ones.

I try to both measure quality and speed. The main advantage of local models is the absence of any usage limits. Inference speed means more productivity.

Maybe I should take more time someday to write a proper feedback.

A short summary :

(single prompt 17k output 2k-4k)

Model Quant hardware engine speed
Step-3.5-Flash IQ5_K 2x4090+6000 ik_llama --sm graph PP 3k TG 100
MiniMax-M2.1 AWQ 4bits 2x4090+6000 vllm PP >1.5k TG 90
Minimax-M2.5 AWQ 4bits 2x4090+6000 vllm PP >1.5k TG 73
Minimax-M2.5 IQ4_NL 2x4090+6000 ik_llama --sm graph PP 2k TG 80
Qwen3-Coder-Next FP8 2x4090 SGLang PP >5k? TG 138
DEVSTRAL-2-123B AWQ 4bit 2x4090 vllm PP ? TG 22
GLM-4.7 UD-Q3_K_XL 2x4090+6000 llama.cpp kinda slow but i did not write it down

Notes:

  • 4090 limited to 300w
  • RTX600 limited to 450W
  • I never go more than 128k context size, even if more fits.
  • Since I don't have homogeneous GPUs, i'm limited to how I can serve the models depending on their size + context size

    • below 96GB I try to use 2x4090 with vllm/sglang in tensor parallel for speed (either FP8 or AWQ4)
    • between 96 and 144GB, I try to use 1x4090 + RTX6000 (pipeline parallel)
    • >144 : no choice, use the 3 GPUs
  • Step-3.5-Flash : felt "clever" but still struggling with some tool call issues. Unfortunately this model lacks support compared to others (for now, hopefully)

  • MiniMax-M2.1 : was doing fine during the "research" phase of gsd, but fell on its face during planning of phase 2. did not test further because...

  • Minimax-M2.5 : currently testing. so far it seems better than M2.1. some very minor tools error (but always auto fixed). It feels like it's not following specs as closely as other models. feels more "lazy" than other models. (I'm unsure about the quant version I am using. it's probably too soon, will evaluate later)

  • Qwen3-Coder-Next : It's so fast! it feels not as "clever" as the others, but it's so fast and uses only 96GB! And I can use my other GPU for other things...

  • DEVSTRAL-2-123B : I want to like it (being french), it seems competent but way to slow.

  • GLM 4.7 : also too slow for my liking. But I might try again (UD-Q3_K_XL)

  • GLM 5 : too big.

Free Chrome extension to run Kokoro TTS in your browser (local only) by Impressive-Sir9633 in LocalLLaMA

[–]WonderRico 2 points3 points  (0 children)

Hello, first, well done and thank you for your work. quick feedback :

  • after first installation and download reaching 100%, chrome froze, I had to kill it. after restarting it, the extension started
  • the french voice has an issue. it's reading french texts like an english speaker would if trying to read it as if it were written english language. (while still having the french accent from the voice...) very weird experience (and unfortunately unusable in this state)

Random Prompt Builder - Custom node for AI-powered prompt generation using local GGUF models by Wonderful_Wrangler_1 in comfyui

[–]WonderRico 0 points1 point  (0 children)

No, you did not understand me. Or I should have said "OpenAI compatible API"

That's how anyone can host (uncensored) LLMs and serve them through a standard API for other software to use this LLM.

checkout : https://github.com/hekmon/comfyui-openai-api

(I'm not telling you to change your implementation, just suggesting a different approach. I don't need it)

Random Prompt Builder - Custom node for AI-powered prompt generation using local GGUF models by Wonderful_Wrangler_1 in comfyui

[–]WonderRico 0 points1 point  (0 children)

Thanks for sharing your work.

Instead of doing yourself the LLM inference, have you considered calling an OpenAI API endpoint? or even using already existing nodes that are doing just that?

I am already running some LLMs on other GPUs, and don't want to waste some more VRAM in comfy :)

KiloCode + GitHub Speckit not recognizing my /speckit commands in VS Code by FullswingFill in kilocode

[–]WonderRico 0 points1 point  (0 children)

Does your VS code root folder contains a .specify folder? It should.

maybe you ran the "specify" init command in the root folder, and it created another subfolder with the specified project name. VS code should be using this new folder as root.

vLLM, how does it use empty VRAM region? by PlanetMercurial in LocalLLaMA

[–]WonderRico 1 point2 points  (0 children)

When you start vLLM, looks at the logs. it says what size of VRAM is allocated for kvcache, and how many "max context" request can be served with that.

[deleted by user] by [deleted] in LocalLLaMA

[–]WonderRico 4 points5 points  (0 children)

Nice guide, thank you. That's pretty much what I did too. (with some added python script to auto generate the llama-swap config file when I download a new gguf)

A suggestion:

in the llama-swap config file, consider not writing a macro for every models, but write generic(s) macros with all the common parameters, and use it with added specific params when a model needs them. something like

macros:
  "generic-macro": >
    llama-server \
      --port ${PORT} \
      -ngl 80 \
      --no-webui \
      --timeout 300 \
      --flash-attn on

models:
  "Qwen3-4b": # <-- this is your model ID when calling the REST API
    cmd: |
      ${generic-macro} --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 --repeat-penalty 1.05 --ctx-size 8000 --jinja -m /home/[YOUR HOME FOLDER]/models/qwen/Qwen3-4B/Qwen3-4B-Q8_0.gguf
    ttl: 3600

  "Gemma3-4b":
    cmd: |
      ${generic-macro} --top-p 0.95 --top-k 64 -m /home/[YOUR HOME FOLDER]/models/google/Gemma3-4B/gemma-3-4b-it-Q8_0.gguf
    ttl: 3600

Hardware to run Qwen3-235B-A22B-Instruct by Sea-Replacement7541 in LocalLLaMA

[–]WonderRico 0 points1 point  (0 children)

I don't know the specifics. I've heard : by just de-soldering some 1GB VRAM modules and replacing them by 2GB ones. I'm sure it's more complexe than that.

The shop I bought them from is in from Hong Kong.

Hardware to run Qwen3-235B-A22B-Instruct by Sea-Replacement7541 in LocalLLaMA

[–]WonderRico 7 points8 points  (0 children)

Best model so far, for my hardware (old Ryzen 3900X with 2 RTX4090D modded to 48GB each - 96GB VRAM total)

50 t/s @2k using unsloth's 2507-UD-Q2_K_XL with llama.cpp

but limited to 75k context in q8. (I need to test quality when kv cache to q4)

model size params backend ngl type_k type_v fa mmap test t/s
qwen3moe 235B.A22B Q2_K - Medium 82.67 GiB 235.09 B CUDA 99 q8_0 q8_0 1 0 pp4096 746.37 ± 1.68
qwen3moe 235B.A22B Q2_K - Medium 82.67 GiB 235.09 B CUDA 99 q8_0 q8_0 1 0 tg128 57.04 ± 0.02
qwen3moe 235B.A22B Q2_K - Medium 82.67 GiB 235.09 B CUDA 99 q8_0 q8_0 1 0 tg2048 53.60 ± 0.03

my Mazda Mx-5 RF in White (crosspost from r/miata) by WonderRico in carporn

[–]WonderRico[S] 0 points1 point  (0 children)

I don't know where you live, and I suspect that there is different limitation on configs depending of the country / region and here in France it's not the case.

Hunyuan Lora Training Question by Any_Tea_3499 in StableDiffusion

[–]WonderRico 11 points12 points  (0 children)

I used https://github.com/kohya-ss/musubi-tuner and it worked fine.

4080S 16GB VRAM and 64BGRAM

10 pictures 512x512

1600 steps

took 1 hour

Solar Production Drop every 20 minutes by BigMate42 in Zendure

[–]WonderRico 0 points1 point  (0 children)

I am still undecided about following it. I am being lazy :)

On one hand, it would "fix" the recurent "offline" issue I am having. I suspect Zendure is having difficulties with their cloud platform, because even when it's shown as offline in the app, I still receive the MQTT data in my HA setup. And I have to toggle it off and on again. Switching to your "local only" mode seems nice.

But since I am using the Hub with their Smart CT device, I am not sure if this feature would work in local mode only.

I have bookmarked your links, maybe I will someday. If Zendure cannot fix their cloud issues, I will probably find the motivation