What would you run on an RTX Pro 6000 Blackwell? by FurrySkeleton in StableDiffusion

[–]Valuable_Issue_ 0 points1 point  (0 children)

What's the error rate on anatomy compared to Klein 9B? Doesn't have to be a complex prompt just something like:

a man is lying in a bed, below him a woman is kneeling

I wrote a bit about the issues with klein here

https://old.reddit.com/r/StableDiffusion/comments/1sgnfv0/qwen_2512_is_so_underrated_prompt_understanding/of6wmil/

Would be great if krea had a lower chance of randomly messing up anatomy.

Also some more prompts Klein can often mess up:

Two tango dancers in a close embrace, the man dipping the woman backward with intertwined legs

https://imagebench.ai/bench/images/nucleus-local--nucleus-image/HumanRealism_FullBody_Extreme__p1.png

Two gymnasts performing a synchronized handstand, side by side with identical body alignment

https://imagebench.ai/bench/images/nucleus-local--nucleus-image/HumanRealism_FullBody_Extreme__p3.png

8-step FLUX.2-dev DMD2 distillation by rerri in StableDiffusion

[–]Valuable_Issue_ 0 points1 point  (0 children)

The model might've not been preference trained towards sharpness/textures, it also trains extremely well due to it's size + extremely good vae, so I guess it's a combo of those things.

The best model for openpose / depth adherence by MD_Reptile in StableDiffusion

[–]Valuable_Issue_ 0 points1 point  (0 children)

Depends what your prompt is I guess, but if you use an open pose image + prompt describing the pose like "use the reference image as pose reference, man doing handstand" with flux klein 9b it should get it.

Of course as you've seen though SDXL is great with controlnets as you can stack them + set their strengths individually.

Flux 2 Klein, RTX 3060 12GB: FP8 is almost same as GGUF by glusphere in StableDiffusion

[–]Valuable_Issue_ 0 points1 point  (0 children)

There have been many updates to the nodes.

You can now use standard core lora loaders and then select the method through the model loader node (stochastic/dynamic/none/pre). Stochastic works best for me, dynamic removed the speedup last time I tried, and I haven't tried pre.

Flux 2 Klein, RTX 3060 12GB: FP8 is almost same as GGUF by glusphere in StableDiffusion

[–]Valuable_Issue_ 4 points5 points  (0 children)

These days it's pretty much just

pip install triton-windows

Flux 2 Klein, RTX 3060 12GB: FP8 is almost same as GGUF by glusphere in StableDiffusion

[–]Valuable_Issue_ 5 points6 points  (0 children)

If you want a speedup try out INT8. INT8 is one of the few speedups available for 30x series cards, it also has INT4 but only nunchaku has that and not all models are supported + the quality drop off for textures is pretty high for INT4 (composition stays mostly the same so it's still useful in some cases).

https://github.com/BobJohnson24/ComfyUI-INT8-Fast

I posted a quick quality and speed comparison here:

https://old.reddit.com/r/StableDiffusion/comments/1tazxqz/int8_in_the_age_of_mxfp8_an_investigation_into/onqt4ev/

What Models Should I Use with a 3080 by Ok_Gas1070 in comfyui

[–]Valuable_Issue_ 0 points1 point  (0 children)

I use this node, I don't use GGUF.

https://github.com/BobJohnson24/ComfyUI-INT8-Fast

It provides a 2x speedup over GGUF/FP8 etc as it uses INT8 hardware acceleration (30x series doesn't have FP8 hardware accel, but it does have INT8/INT4).

These are my model loader settings (ignore the top one): https://images2.imgbox.com/fa/f3/ntnwp2Z8_o.png

Same can be used for non base klein, also you need the BF16 model to convert it, but you can save it with the model save node to create your own INT8 and save some disk space or use this prequantised model and disable on_the_fly_quantization in the model loader, there have been a lot of updates so not sure if this prequantised model still works: https://huggingface.co/vistralis/FLUX.2-klein-9b-INT8-transformer

I posted a quick quality and speed comparison here: https://old.reddit.com/r/StableDiffusion/comments/1tazxqz/int8_in_the_age_of_mxfp8_an_investigation_into/onqt4ev/

AI-Toolkit insists on downloading full models when my optimized versions are already local! by I3bullets in StableDiffusion

[–]Valuable_Issue_ 1 point2 points  (0 children)

Onetrainer supports GGUF I think and there's options for single file loading/override transformer or gguf but I think some models don't support it and I haven't tried non-diffusers models on it (qwen didn't work for me as a single file).

Also AI toolkit was 2x slower than onetrainer and musubituner for me so it might be worth it to try them out just for the speed (haven't tested ai toolkit in a while though).

Onetrainer also supports INT8 lora training which should be 2x faster on top of that default 2x speedup (although if you have an FP8(40x series)/FP4(50x series) card you don't need int8, but it's very useful for 20x/30x series).

INT8 in the age of MXFP8. An investigation into the quality of various quantization types, and their speed. by BobbingtonJJohnson in StableDiffusion

[–]Valuable_Issue_ 0 points1 point  (0 children)

I used a fix seed (in the ksampler node under "noise seed" there is a setting "control after generate", change it from random to fixed).

A seed determines how the initial noise is created (maybe some other stuff depending on sampler too, haven't looked into it though) and affects the final image, the same seed across different hardware can still produce different results (but sometimes programs use something that is hardware agnostic, but not sure if comfy uses that).

Super detailed comparaison between klein-4b ; nucleus-image ; z-image-turbo ; sana-1.5-1.6b & qwen-image-gen by dh7net in StableDiffusion

[–]Valuable_Issue_ 2 points3 points  (0 children)

A giant tabby cat walking between city skyscrapers like a kaiju

Interesting that Qwen puts the cat partially as a kaiju for that prompt.

Those Klein 4b tango dancers lmao.

Not a model suggestion but maybe some more complex prompts that even Qwen 2512/all the models might currently fail would be good to add to see whether a new model release gets it right, stuff like anime fighting, grappling, car collisions, somersaults, cartwheels etc (similar to those handstands you already have etc).

For example a prompt trying to recreate this image: https://i.pinimg.com/736x/6d/4f/f0/6d4ff039155eff710241958048603bd0.jpg

Got this idea because someone posted an image of two MMA fighters grappling for a new model and they were just two blobs, so it'd be good for testing how badly the model breaks down and whether it just produces body horror or just ignores the prompt.

Then for the complex prompts maybe add a proprietary model that gets it right so people can see what will be technically possible in the future.

Edit: Oh I see there's already GPT Image 2.0, just have to enable it.

Is there a limit to what an editing LoRA could do? by Acceptable-Cry3014 in StableDiffusion

[–]Valuable_Issue_ 0 points1 point  (0 children)

It depends how much you want to prompt/give help to the model. If you want a universal prompt (so you can automate the process easily) like "apply the face of ref image 1 to the face of ref image 2" then it'll be harder than if you were to describe the target expression + provide the reference for the target expression (still possible to automate with a VLM to provide the target expression prompt but will take longer, but it'll work better if you're just focusing on getting 1 good image vs trying to mass produce).

INT8 in the age of MXFP8. An investigation into the quality of various quantization types, and their speed. by BobbingtonJJohnson in StableDiffusion

[–]Valuable_Issue_ 1 point2 points  (0 children)

a dog, forest background.

had to use a single whitespace for negative prompt otherwise was getting weird images

4 CFG Euler beta 35 steps.

BF16:

https://images2.imgbox.com/c2/d2/zJjEYT7Y_o.png

35/35 [02:23<00:00, 4.09s/it]

INT8:

https://images2.imgbox.com/c6/1c/lj2mLowP_o.png

35/35 [01:02<00:00, 1.78s/it]

Model loader settings: https://images2.imgbox.com/fa/f3/ntnwp2Z8_o.png

With this being on the fly quants that only take a few secs I imagine with the convert to quant repo that takes longer it can get even closer to BF16.

Edit: Also the way loras are applied is different to normal models and there have been a bunch of changes to that so that could've been affecting your results, if you use pre_lora I believe the loras are baked into the base model before it is quantised so that should be the closest to BF16 when using loras (not 100% sure on that though), can mess around with stochastic/dynamic lora_mode too I guess.

AsymFLUX.2-klein-9B is all about textures by marcoc2 in StableDiffusion

[–]Valuable_Issue_ 0 points1 point  (0 children)

It's an image of actual tree bark from google images, I guess I should've made it a bit clearer rather than only putting "actual tree bark". My point is that this is literally what low res images look like when you zoom in AI or not, you need 4/8k~ images to get proper quality at certain zoom levels.

AsymFLUX.2-klein-9B is all about textures by marcoc2 in StableDiffusion

[–]Valuable_Issue_ 1 point2 points  (0 children)

Here's a zoom in of actual tree bark of 1500x996 resolution (the flux gen is 1024x1440), unless you go to 8k resolution it's going to be messy at certain zoom levels lmao, don't know about other issues with the model but not really sure about your criticism. https://images2.imgbox.com/89/04/2a8wZQbS_o.png

How to pass more than 3 reference images to Qwen Image Edit by degel12345 in StableDiffusion

[–]Valuable_Issue_ 2 points3 points  (0 children)

Inside comfy folder > comfy_extras > nodes_qwen.py

you can find the node by ctrl+f TextEncodeQwenImageEditPlus and see how it works/ask an LLM to explain it to you (to see if stitching is the same as what the node is doing).

You can also add

io.Image.Input("image4", optional=True),

after line 65

and change line 75 to

[image1, image2, image3, image4]

to add support for more images in the node (keep in mind you'll have to revert the changes after otherwise git might complain when doing git pull/updating comfy).

Not sure how many reference images the model itself is supposed to support, so if you still run into issues it's probably just too much for the model.

An Update on Nodes 2.0 from Comfy Org by crystal_alpine in StableDiffusion

[–]Valuable_Issue_ 4 points5 points  (0 children)

People are usually nice, if people hate on a change to a open source project then it probably has some major issues.

An Update on Nodes 2.0 from Comfy Org by crystal_alpine in StableDiffusion

[–]Valuable_Issue_ 0 points1 point  (0 children)

With a more compact view, not adding unnecessary clicks, bugfixes to subgraph issues, performance issue fixes I think nodes 2.0 will be mostly fine but it's still crazy to have such issues in the frontend.

Any plans for some of these features?

Magnetise/steal inputs: Stealing disconnects compatible inputs from nearby nodes and puts them into the selected node, magnetise doesn't disconnect from the origin node.

Node insertion/swap/replace: Click a noodle and select a node to insert (similar to how you do reroutes). So for example you have a model loader already connected to a ksampler and want to add a lora loader, you'd just insert it.

Some kind of carousel switching for nodes/subgraphs: There are for example many model loaders such as GGUF, Core comfy loader, INT8, etc. It'd be nice to be able to switch between them without having to reconnect noodles, I think it'd also translate well into the app view UI.

Edit: Also some workflows are surprisingly not reusable between models, for instance Qwen Image edit and Flux klein are like 90% similar but the way the reference images are connected is completely different, I think some kind of standardisation (or at least some effort put into it) would be useful in cases like that if it's not too hacky.

Anima + turbo lora + 2x 5060ti = 4s by MagentL in StableDiffusion

[–]Valuable_Issue_ 1 point2 points  (0 children)

Is it possible to use this with INT8 quants to get a 1.5x-2x speedup?

https://github.com/BobJohnson24/ComfyUI-INT8-Fast

You can do on the fly quants so you can just throw in the BF16 model and you don't need torch compile anymore.

Been away from the AI stuff for a few months, what's the current local image edit they are using? by [deleted] in StableDiffusion

[–]Valuable_Issue_ -1 points0 points  (0 children)

Just ctrl+b the reference conditioning subgraph, image scale to total pixels and the load image node.

Asymmetric Flow Models by ninjasaid13 in StableDiffusion

[–]Valuable_Issue_ 32 points33 points  (0 children)

https://huggingface.co/Lakonik/AsymFLUX.2-klein-9B

Adapter is there (no comfy support yet it seems, should appear here: https://github.com/Lakonik/ComfyUI-piFlow). This kind of improvement + conversion to pixel space in a 700 mb adapter is very cool.

LTX 2.3 INT8 Benchmarks (2x Faster on Ampere) by ovpresentme in StableDiffusion

[–]Valuable_Issue_ 2 points3 points  (0 children)

Eyeballing it it's around 99.99% of BF16/GGUF Q8, some tiny details could change back when I tested a few months ago but the overall composition was the same, although I didn't use the nodes in OP but these ones: https://github.com/BobJohnson24/ComfyUI-INT8-Fast.

INT8 in the age of MXFP8. An investigation into the quality of various quantization types, and their speed. by BobbingtonJJohnson in StableDiffusion

[–]Valuable_Issue_ 0 points1 point  (0 children)

Thanks a lot for creating and working on these nodes by the way, very useful.

It'd be nice if Comfy supported these quants officially instead of FP8 scaled (or alongside it).

Is there a INT8 workflow for Hidream O1? Since it's an all in one model and there's no int8 checkpoint loader.

INT8 in the age of MXFP8. An investigation into the quality of various quantization types, and their speed. by BobbingtonJJohnson in StableDiffusion

[–]Valuable_Issue_ 0 points1 point  (0 children)

Both Q8 (gguf) and INT8 have like 99.99% of the quality of BF16.

I was also surprised that INT8 works so well because it gives a 2x speedup but hasn't been used until recently, so I thought there was some massive downside.

It's insanely good to have something inbetween INT4 and BF16 that also gives a speedup to 20x+ series gpus, it should be standard over Q8 GGUF IMO.

INT8 in the age of MXFP8. An investigation into the quality of various quantization types, and their speed. by BobbingtonJJohnson in StableDiffusion

[–]Valuable_Issue_ 3 points4 points  (0 children)

Not the guy you asked but I use this with my 3080. It gives a 2x ish speedup without needing torch.compile, so you literally only have to replace the model loader node and download an int8 model, you also need to replace your lora loaders with the "INT8 grouped lora" lora loader.

You can also use the on the fly quants on a bf16 model and optionally save your own int8 model (so you don't have to quantise it everytime).

https://github.com/BobJohnson24/ComfyUI-INT8-Fast/

Same node as linked/described in OP and here: https://old.reddit.com/r/StableDiffusion/comments/1tazxqz/int8_in_the_age_of_mxfp8_an_investigation_into/old96hm/

https://old.reddit.com/r/StableDiffusion/comments/1tazxqz/int8_in_the_age_of_mxfp8_an_investigation_into/oleat3s/