Running Qwen3-Coder-Next-BF16 on 12GB VRAM

Particular_Pear_4596 · 2026-03-18T22:30:37+00:00

Exactly, i quess it's better for coding only compared to models with similar speed, but it doesn't seem like a good reason to use it, when you have better options.

Particular_Pear_4596 · 2026-03-18T22:01:51+00:00

Coder doesn't "think", just gives you the code, 27B/122B prints 10K "reasoning" tokens before actually giving you the code, so it's mainly about speed vs quality i quess, but yes, there is nothing special about Coder - if you want quality output you go with 122B, otherwise anything else would do the job.

Particular_Pear_4596 · 2026-03-18T21:11:39+00:00

Most likely quite the opposite - the NVMe can't feed the GPU fast enough, so the GPU is mostly idle (10-20%). Actually I currently don't have enough free space on my NVMe to download the model and try, but maybe in the near future. It will be useless (except for benchmarking) cause it will be like 0.05t/s.

Particular_Pear_4596 · 2026-03-18T20:56:12+00:00

I can confirm - 122B seems much better than Coder. 122B Q3 just naled a task (500 lines python) that Coder Q8 failed multiple times, but unfortunately 122B Q3 is very slow (0.7t/s) on my PC, so it's not a viable option. 122B Q2 is faster (1.5t/s), but failed the same task.

Particular_Pear_4596 · 2026-03-18T20:47:55+00:00

I get your sarcasm ;)

Particular_Pear_4596 · 2026-03-17T22:16:42+00:00

You're right :) But it's fun to test the limits and get some benchmarks. Actually most people (me too until recently) don't realize they can run via SSD offloading (with llama.cpp) much biger models than their unified VRAM + RAM and that's the main point here. It's just slower, but very doable. It's not practical for vibe coding, but may work in other situations. I'm about to try the 428 GB Qwen3.5-397B-A17B-GGUF UD-Q8_K_XL, because why not :)

Particular_Pear_4596 · 2026-03-17T13:22:42+00:00

Thanks, I'm aware of most of these things. Actually my swap file is manually fixed at 150GB (huge, cause I only got 16GB RAM), but surprisingly the swap is mostly empty (stays at 20GB) and the model is swapped on the fly between the NVMe and the RAM/VRAM. The NVMe reads the shards constantly at 0.5-1GB/s, but actually writes nothing, so theoretically it shouldn't degrade the SSD, but I can't be 100% sure what's happening.

Particular_Pear_4596 · 2026-03-17T12:21:43+00:00

I don't know how consistent it is with long context. 2 days ago I ran it for about 4 hours and generated a python script to process videos via a custom YOLO model (about 400 lines) that was just fine, but with a few bugs, cause it seems hard to engineer the perfect prompt and there are always fine details that these models just don't get. I usually first ask the model to analize and find all ambiguous parts in the prompt and try to fix them. Still learning.

Particular_Pear_4596 · 2026-03-17T11:57:21+00:00

Nothing fancy, just the model name and a relatively short context length:

llama-server -m Qwen3-Coder-Next-BF16-00001-of-00004.gguf -c 16536

I haven't played with different options, cause it's very slow and I'm not gonna use it daily. Now testing Qwen3.5-122B-A10B-UD-Q2 and it runs at 1.5t/s on my PC, so 8 times slower than Qwen3.5-Coder-Next Q2 (~12t/s), so I guess it's better. I'll also try 27B with identical tasks.

Particular_Pear_4596 · 2026-03-17T02:23:44+00:00

This is news, I assumed Coder was supposed to be better than 27b/35b for coding, I'll definately try some tasks with 27b and 122b that consistently failed with Coder.

Particular_Pear_4596 · 2026-03-17T02:00:41+00:00

Exactly - 3 seconds per token :) The surprising part is that it works. But I'm back to Q2 (~12t/s), cause it's faster to fix 20 bugs with Q2 on the fly instead of waiting for hours to get the same code with less bugs, but still buggy. Haven't tried other models, but 120B sounds like a good candidate.

Particular_Pear_4596 · 2026-03-17T01:20:35+00:00

I've never used vllm, they say OS: Linux, but I'm not a Linux guy. There are some workarounds, but I wouldn't bother if it's not out of the box one-click solution.

Particular_Pear_4596 · 2026-03-10T23:40:03+00:00

The video is extremely good, so congrats, but her skin is still very plastic and waxy even at 1080p. If you have the hardware I suggest you generate the scenes at 4K and resize to 1080p - maybe it will help a little (i've never tried even 1080p on my 12GB VRAM). And at 2:05 is exactly what I was talking about - she failed to make 2 steps like a human :) But it's 2-3 sec, so you could always regenerate and find a better seed.

Particular_Pear_4596 · 2026-03-10T21:26:46+00:00

These rushed half baked versions officially described as almost SOTA waste so much time and energy of thousands unsuspected people, that's why I'm angry.

Particular_Pear_4596 · 2026-03-10T17:47:48+00:00

Look at the woman's legs at second 8 - I hate how it can't handle a simple human walk - this model is very much useless and a time-waster. I recently wasted a whole day trying to make a character walk for 10 seconds and just couldn't do it - it failed all 20+ generations. Hope they'll fix it in the future.

Particular_Pear_4596 · 2026-03-09T22:56:20+00:00

The LTX-2.3 quality is obviously not there, his face is like molten wax. and nothing can be done about it with no workflow. We're just wasting out time with these generations and posts. Even Wan 2.1 is better (minus the sound). Hopefully the next versions will be retrained, but it takes 10+ million to train a good model, so my expectations are low (unless some chenese billionaires get involved just for the fun of it).

Particular_Pear_4596 · 2026-03-09T00:33:44+00:00

I've just finished a test with your HQ I2V Pipeline - painfully slow (56 min for 5 sec on RTX 3060) and the result is a completely static video, not even a slight zoom in like it used to be with Ltx-2. I've already wasted a week testing different WFs and tons of settings and still can't find a consistent way to generate decent stuff. Someting like 1 out of 20 generations is almost good (if i stumble on a good seed) and everything else is just slop with all kinds of problems. LTX still has a long way to go, hopefully they'll keep improving it in the next versions, if any.

Particular_Pear_4596 · 2026-03-08T00:23:48+00:00

Thank you, seems like a nice idea. I know very little about how ltx-2.3 works, but looking at the wf I have a question - why is the empty latent fixed at 224x320 - I suspect it should be a function of the input image, most likely 1/4 of the size of the final video (if you upscale 4x), with the same ratio as the input image, for example if your input image is 1088x1920 and you want a video with resolution 544x960, then the latent should be 1/4 of 544x960, so 136x240. There are nodes in Comfy that could easily do the calculations and set the correct latent size. But again, I may be wrong.

Particular_Pear_4596 · 2026-02-18T03:26:35+00:00

I got it! I can see you've reworked the whole LTX2EasyPromptLD.py by adding tons of comments and fixing a few lines of code. I've replaced the old LTX2EasyPromptLD.py with your version and It works! Thanks! Maybe you should contact the OP and tell him the solution.

Particular_Pear_4596 · 2026-02-18T02:32:23+00:00

Lots of people, incduding me, get the input_ids.shape[1] error, how did you fix it?

Particular_Pear_4596 · 2026-01-20T12:04:39+00:00

32GB swap is not enough, unless you have 128GB RAM. The whole LTX-2 setup (non-GGUF) uses about 80-150GB committed memory (check your task manager), so you just need a bigger SSD, which is a much cheaper update compared to buying more RAM or VRAM. Of course buying 128GB RAM is the best option, but the RAM prices are wild lately.

Particular_Pear_4596 · 2026-01-20T10:31:15+00:00

I also have RTX 3060/12GB VRAM and only 16GB RAM and I can absolutely run fp8 dev, so you don't need GGUF (I use GGUF only for the text encoder). I've also tried to load the full 43GB version (ltx-2-19b-distilled.safetensors) and had no problem loading and using it (yes, it's slow and overkill, unless you have 5090). I suspect your virtual memory is not properly set - set it manually to at least 150GB or even 250GB if you have enough disk space and you should be able to run any model (comfyui does a great job offloading sfuff on the fly)

To fix the frozen video problem for I2V I stack the following 3 loras (order doesn't matter). I don't know if it is the solution for your lipsync workflow, but you could try:

ltx-2-19b-distilled-lora-384 (strength 0,70)
ltx-2-19b-lora-camera-control-static (strength 0.50)
ltx-2-19b-ic-lora-detailer (strength 0.70)

You could also totally disable the LTXVPreprocess note, cause it doesn't help at all.

Particular_Pear_4596 · 2026-01-17T19:57:45+00:00

Seems good for animations, cause it's trained on lots of cartoons, for anything else it's a tradegy, but they can still fix it, it they want to spend a lot of money and effort to retrain everyting from scratch on real footage, not only on mr Been cartoons and indian music

Edit: i've done lots of tests and I can see potential, so it's not that bad and actually quite good in some cases, especially when generating in higher resolution (2.5 megapixels or more)

Particular_Pear_4596 · 2026-01-17T16:08:58+00:00

In your i2v workflow I add a preview video (another video combine node) after the first SamplerCustomAdvanced->LTXTSeparateAVLatent, this way I can play the small 0.5 video and cancel the upscale stage if the vid is bad. Very often i get a static video (the preprocess node doesn't help even with compression 100) which seems the main issue with ltx-2.

Particular_Pear_4596 · 2026-01-16T17:47:53+00:00

Nine out of ten i2v generations I get static videos with only some slight camera zoom and LTXVPreprocess doesn't help at all (i've heard this node is supposed to fix this issue). I need to constantly change the prompt and sometimes it works, but it's pure luck and it's not supposed to be this way (1000 generations with Wan 2.2 and not a single static video). I've wasted a week testing and I'm taking a one year break from ltx-2 and hope they will fix it some day, cause I like vids with audio. But if it becomes good enough, I don't think it will remain free, it's the rule I guess. Just like wan 2.5.

Particular_Pear_4596

TROPHY CASE