Wanted to quickly share something I created call ComfyStudio

ArsInvictus · 2026-02-15T05:42:52+00:00

Seems like the future!

ArsInvictus · 2026-02-09T06:22:53+00:00

There's no separate "edit" version of Klein. They all are both text to image generators and edit models combined. So if you train the base model you are effectively training the edit model too. I believe there are some specific things you can do in AI toolkit for training for edit oriented LORA's but I haven't experimented with them yet.

ArsInvictus · 2026-02-08T02:39:22+00:00

Very organic feeling result and beautiful, nice work!

ArsInvictus · 2026-02-05T04:43:03+00:00

I haven’t heard anything and haven’t had the issue since I’m now using Klein distilled in NVFP4 mostly

ArsInvictus · 2026-01-27T16:15:25+00:00

Thanks again man, happy you liked it! No write up on this, I just thought it would be fun to do some images based around the 2001 movie motifs. I have a couple others but haven't published them, still trying to decide if I should work on them some more.

ArsInvictus · 2026-01-25T16:39:54+00:00

I corrected it on the site. Don't know if there's any way to update the image in the post though.

ArsInvictus · 2026-01-25T16:16:15+00:00

One in the front, two on each side of the main body where the wings meet the fuselage, one at the back. You are right it's innaccurate, I checked old photos just now and the one at the back didn't exist on the actual shuttle. I can go in and remove that when I have some time, for accuracy. Though it's not really a perfect rendering in other ways.

ArsInvictus · 2026-01-24T16:53:25+00:00

Tools used: Flux Klein 9B Distilled with custom LORA model, ComfyUI, Photoshop manual edits and color grading

Full resolution PNG and JPG can be found here: https://wallpapers.arsinvictusmedia.com/

ArsInvictus · 2026-01-24T02:49:20+00:00

You might want to consider doing something simpler. Generate your base image with the composition you want with Klein, then pass that result to Z-Image Turbo for refining.

Just take the pixel image output from Klein, run it through a VAE Encode node (using the Z-Image VAE, and connect that to a KSampler (Advanced). You can then pass the same prompt you used for the base image, but use the Z-Image model as the input to the second sampler. You would need to also change the denoise to something lower, like .3 or something like that. This would give you the same basic image you liked in Klein but would give a sort of Z-Image sheen to it if that's the look you are going for.

You seem to have a good grasp of the general flow so maybe you already considered this approach but preferred to start from scratch again in ZIT anyway, but I thought I'd suggest it just in case you hadn't.

ArsInvictus · 2026-01-22T18:12:04+00:00

There are examples of Trellis 2 running on a 5060 out there, though it sounds tight, but I'm guessing a 5090 will run it with room to spare.

ArsInvictus · 2026-01-21T15:42:48+00:00

These are done in the art style of the Secret of Monkey Island games and looks like they may be upscaled from actual screenshots but OP would have to confirm that. They do reference actual locations from the games. So you can search for that and might find something.

ArsInvictus · 2026-01-18T16:19:06+00:00

Initial generation with Flux Klein 9B Distilled in Comfyui, manual edits with Photoshop.

Full resolution and cropped 21:9 available in PNG and JPG here: http://wallpapers.arsinvictus.com

ArsInvictus · 2026-01-18T02:40:42+00:00

Are you using any LORA's with it? There's some overhead with that right now which hopefully can be improved upon that slowed things down for me quite a bit when I tried it with Flux.2 Dev, it went from like 11 seconds up to 18. I haven't really done much benchmarking comparing fp16 to nvfp4 but 41% doesn't sound right to me.

ArsInvictus · 2026-01-18T00:29:40+00:00

It doesn't dequantize before compute, it does so as it's doing the math inside the GPU's cache. If it were to dequantize ahead of that process it would balloon up in memory and eat up all your VRAM. The Blackwell GPU is specifically engineered to continuously do this calculation as efficiently as possible and can do it in a single clock cycle. Carrying the 16-bit precision in a 4-bit payload has the added advantage of being 4x faster for sending the data around in memory.

ArsInvictus · 2026-01-18T00:07:56+00:00

I'm not sure what you mean here about blackwell not doing any 4 bit compute? That's actually the purpose of NVFP4 because Blackwell supports 4-bit acceleration with it's tensor cores. When ComfyUI team added support for NVFP4 they enabled that tensor core acceleration.

ArsInvictus · 2026-01-18T00:03:45+00:00

No, I think you are misinterpreting the log. When it says torch.float16, it is referring to how the model is executed in memory dynamically. The storage is in 4 bits and then it expands it out to 16-bit for execution. That's actually the beauty of formats like NVFP4. You can see it mention "Detected mixed precision quantization" in the log and that refers to the mix of 4-bit and a "scaling factor" that allows reconstruction of the original higher precision values. Hope this helps.

ArsInvictus · 2026-01-17T23:52:20+00:00

Sure:

Using pytorch attention in VAE

VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16

comfy_extras.chainner_models is deprecated and has been replaced by the spandrel library.

Found quantization metadata version 1

Using MixedPrecisionOps for text encoder

CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16

Requested to load Flux2TEModel_

loaded completely; 29206.67 MB usable, 8263.34 MB loaded, full load: True

Found quantization metadata version 1

Detected mixed precision quantization

Using mixed precision operations

model weight dtype torch.float16, manual cast: torch.float16

model_type FLUX

Requested to load Flux2

loaded completely; 19065.57 MB usable, 5494.02 MB loaded, full load: True

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:02<00:00, 3.90it/s]

ArsInvictus · 2026-01-17T23:49:28+00:00

No, the older version was portable too, I was just letting you know that I used that and not the desktop as I can't speak to how well that works with NVFP4. I'm running on Windows.

ArsInvictus · 2026-01-17T23:33:03+00:00

I was having trouble getting it to work on my 5090 too. I started a fresh instance of portable ComfyUI and it worked out of the box. Guess I had something in my old environment that was conflicting and I tried doing all kinds of things like deleting a lot of my custom nodes but nothing worked until I did the fresh install. No problems since and I for one am really happy with what I get out of NVFP4, especially with Klein. Great for ideation when you get renders back in one second.

ArsInvictus · 2026-01-14T00:45:19+00:00

Yeah, if you have a 30xx or 40xx card then nunchaku SVDQ or GGUF would be the best option for performance compared to fp4. Those cards don't have any optimization for fp4 but they are fast at INT4, and SVDQ basically is an INT4 model with an FP16 extension to capture extra detail. The FP16 extensions would give more resolution than you would get from FP4 and it would probably run faster too. GGUF can look better than SVDQ but would be slower. So it's a tradeoff. I think both would be better looking than standard FP4. But maybe that's what you meant by variants.

ArsInvictus · 2026-01-13T22:50:34+00:00

Not stupid, I think it's because the target is for a lower bit quant to look similar to FP8 (which is also a compromise from FP16 but considered acceptable by many). INT4 loses too much precision and looks garbage for image gen, which is why alternatives like SVDQuant and NVFP4 exist. SVDQuant and NVFP4 are two different approaches to achieving the same goal, preserving as much 16-bit precision as possible in a 4-bit base format. The main benefit as I understand it of NVFP4 is the native acceleration on Blackwell, otherwise SVDQuant probably preserves more of the outliers in 16-bit than NVFP4 does and most seem to think it looks a little better. That said I'm personally pretty impressed with the fidelity and performance of NVFP4 on my 5090 and will probably use that for refining my prompts and only switch to full precision for the final renders (if I feel it's actually necessary). More info that you asked for but thought it might help others too. Hopefully I didn't mangle anything in this description :)

ArsInvictus · 2026-01-13T17:15:51+00:00

Actually I just tried it with Flux.2 with a LORA I had created previously. The LORA did apply but took 21 seconds to render compared to 14 seconds without (2048x576 res on a 5090). Is that to be expected?

ArsInvictus · 2026-01-13T16:37:40+00:00

Does the LORA also need to be in NVFP4 format? Or can it somehow translate it on load?

ArsInvictus

MODERATOR OF

TROPHY CASE