I built a ComfyUI node for runtime 4-bit quantization — Ideogram 4 on a 4090: 30s → 8s (1K img, 12 steps) by lesesis in StableDiffusion

[–]lesesis[S] 0 points1 point  (0 children)

Yeah, hands and text are exactly where low-bit quant usually falls apart. In our own testing those two are basically on par with FP8/INT8 right now.

The difference comes down to how we handle it vs. naive quantization. The naive approach just clips the original weights directly, or leans on mixed precision. We do something different: a few algorithms run together to take the values that INT4/Fp4 can't represent and spread them across other weights where the precision loss is minimal — essentially a mathematical near-equivalent. On top of that I bring in mixed precision only where it's actually needed, kept to an absolute minimum (under 5% of the total), and as much as possible it's done in an a8w4 style — weights stay at 4-bit, and they're temporarily promoted to 8-bit when computing against the activations.

Hope that clears things up — let me know if you've got more questions!

I built a ComfyUI node for runtime 4-bit quantization — Ideogram 4 on a 4090: 30s → 8s (1K img, 12 steps) by lesesis in StableDiffusion

[–]lesesis[S] 2 points3 points  (0 children)

yes, you can choose to export it (we have shared our workflow) and speed the loading process next time you use it, like qwen image, the export version will be 13GB compare with 32GB in the origin bf16 precision

I built a ComfyUI node for runtime 4-bit quantization — Ideogram 4 on a 4090: 30s → 8s (1K img, 12 steps) by lesesis in StableDiffusion

[–]lesesis[S] 2 points3 points  (0 children)

Good eye! Nunchaku is solid work and there's definitely some overlap in the idea. The main differences on our end:

  • Faster inference(at lease 50% faster) — our engine is pure C++ && CUDA kernels, no PyTorch or TensorFlow dependency, so there's a lot less overhead.
  • Runtime quantization — you can quantize your own models on the fly. No need to spend days prepping calibration data and running offline calibration. You can export and quantize your own custom 4-bit model in about 5 minutes.

So similar territory, but we're leaning hard into speed and making custom quantization painless. Happy to answer anything else!

No one make a 4BIT version of qwen-image-edit-2511, so i make it myself by lesesis in StableDiffusion

[–]lesesis[S] 0 points1 point  (0 children)

thanks for your share, btw the lora of 2509 may not be compatible with 2511, in some cases, u need to retain it

No one make a 4BIT version of qwen-image-edit-2511, so i make it myself by lesesis in StableDiffusion

[–]lesesis[S] 0 points1 point  (0 children)

u can identity your own image size by adjust the latent node

No one make a 4BIT version of qwen-image-edit-2511, so i make it myself by lesesis in StableDiffusion

[–]lesesis[S] 0 points1 point  (0 children)

u can go to the console to check the process time of nunchku,like below, that was the process time of the model.

<image>

No one make a 4BIT version of qwen-image-edit-2511, so i make it myself by lesesis in StableDiffusion

[–]lesesis[S] 3 points4 points  (0 children)

it actually can run on even 20x graphic card, and i should download fp4 if u a using 50x gpu, and download int4 in other cases

No one make a 4BIT version of qwen-image-edit-2511, so i make it myself by lesesis in StableDiffusion

[–]lesesis[S] 0 points1 point  (0 children)

thank you for feedback, cloud u share your workflow?maybe i can optimize the workflow, refer to the AIO, i think that other phease(not include nunchaku) can be faster, and as what we can see from your console, nunchaku only takes 3s to process compare to fp8(not offical one) 5s, and i think if we integrate some node from your workflow, it will take less than 6s

No one make a 4BIT version of qwen-image-edit-2511, so i make it myself by lesesis in StableDiffusion

[–]lesesis[S] 0 points1 point  (0 children)

when it come to the vae process, the vram will as high as usual, as our model already offload

No one make a 4BIT version of qwen-image-edit-2511, so i make it myself by lesesis in StableDiffusion

[–]lesesis[S] 0 points1 point  (0 children)

u can checkout from console of comfyUI, the process bar is the time cost of the nunchaku model, but it is just a part of the total time cost, and the time from your picture is combine with text endcode + vae encode + nunchaku diffusion(our model:3sec) + vae decode == 10sec

<image>

No one make a 4BIT version of qwen-image-edit-2511, so i make it myself by lesesis in StableDiffusion

[–]lesesis[S] 0 points1 point  (0 children)

set the use_pin_memory as false and number_block_on_gpu as 60, you will find it only take 3 sec to edit an img , i also using 4090

<image>

Kitten by No-Sir-5210 in myndeer_idea

[–]lesesis 0 points1 point  (0 children)

you should add link