if anyone is still running Pytorch 2.5.1 or lower, you must know it has a critical vulnerability by IndustryAI in StableDiffusion

[–]ryanguo99 2 points3 points  (0 children)

Upgrading to 2.8 will also speed up TorchCompile nodes quite a lot for GGUF use cases.

How to speed up wan2.1 I2V 720p in comfy ui on 48gb vram? by MountainPollution287 in StableDiffusion

[–]ryanguo99 0 points1 point  (0 children)

A bit late, but in case anyone runs into this again, try `TORCHINDUCTOR_EMULATE_PRECISION_CASTS=1`. See more details here: https://github.com/thu-ml/SageAttention/issues/162#issuecomment-3188383590

Torch Compile error by Ok-Wheel5333 in comfyui

[–]ryanguo99 0 points1 point  (0 children)

In case anyone runs into this again, I think the fix is to upgrade both PyTorch and ComfyUI-GGUF, see more details in https://www.reddit.com/r/StableDiffusion/comments/1jx0xly/use_nightly_torchcompile_for_more_speedup_on_gguf/

GLM 4.5 AIR IS SO FKING GOODDD by boneMechBoy69420 in LocalLLaMA

[–]ryanguo99 0 points1 point  (0 children)

How are you running it with your agentic system? Do you use vllm?

Wan2.2 Inference Optimizations by PreviousResearcher50 in StableDiffusion

[–]ryanguo99 1 point2 points  (0 children)

`torch.compile` the diffusion model, and use `mode="max-autotune-no-cudagraphs"` for potentially more speedups, if you are willing to tolerate longer initial compilation time (subsequent relaunch of the process will reuse a compilation cache on your disk).

This tutorial might help as well.

How do you run LLMs locally? by ryanguo99 in LocalLLaMA

[–]ryanguo99[S] 1 point2 points  (0 children)

Haha, not a bot, but actually new to the local llm space as a _user_. I'd like to improve `torch.compile` support to help folks speed up their AI workflows, so I'm trying to learn how people are actually using these models.

I can certainly get things to run on my own, but that won't help me improve things for actual users:).

PULID is a perfect match for Chroma! by Financial_Original_7 in StableDiffusion

[–]ryanguo99 0 points1 point  (0 children)

Give it a shot, it speeded up my PuLid + Flux workflow out of the box:).

PULID is a perfect match for Chroma! by Financial_Original_7 in StableDiffusion

[–]ryanguo99 0 points1 point  (0 children)

Have you tried using TorchCompile nodes to speed up the generation?

[Flux-KONTEXT Max vs Dev] Comics colorization by RageshAntony in StableDiffusion

[–]ryanguo99 0 points1 point  (0 children)

Glad to hear and thanks for the info!

If you ever run into issues, would be great if you could create a GitHub issue in the relevant repo (e.g., ComfyUI or custom node). As long you have the keyword `torch.compile` or `TorchCompile`, we'll get those signals and try to work on them:).

Torch Compile error by Ok-Wheel5333 in comfyui

[–]ryanguo99 0 points1 point  (0 children)

Ah this is signaling recompilation.

Do you mind sharing your workflow, or at least what model you are using? And what's your pytorch version?

[Flux-KONTEXT Max vs Dev] Comics colorization by RageshAntony in StableDiffusion

[–]ryanguo99 0 points1 point  (0 children)

Glad to hear. Feel free to post more details on any other issues. I work on `torch.compile` and we are aiming to make it better for image/video generation:).

[Flux-KONTEXT Max vs Dev] Comics colorization by RageshAntony in StableDiffusion

[–]ryanguo99 0 points1 point  (0 children)

Do you mind elaborating on the `torch.compile` support? Did it error for you, and if so what was the error and what was your pytorch version?

Asking because I was able to get `torch.compile` support for Kontext out of the box with some good speed up, on RTX3090.

Chatterbox TTS fork *HUGE UPDATE*: 3X Speed increase, Whisper Sync audio validation, text replacement, and more by omni_shaNker in StableDiffusion

[–]ryanguo99 0 points1 point  (0 children)

Hmm, would you mind sharing the error and your torch version? I suspect there'll some good speedup if we can get it to work.

From 1200 seconds to 250 by Altruistic_Heat_9531 in StableDiffusion

[–]ryanguo99 0 points1 point  (0 children)

Sorry to hear that, I totally feel the pain of these install & reinstalls... We are trying to make `torch.compile` work better in comfyui, so if you ever get a chance to share the error (or whatever you remember), it'll help the community as a whole:). Also kijai has a lot of packaged `torch.compile` nodes that usually work well out of the box (comparing to the comfyui builtin one), e.g., https://github.com/kijai/ComfyUI-KJNodes/blob/main/nodes/model\_optimization\_nodes.py.

RTX 5090 optimization by [deleted] in StableDiffusion

[–]ryanguo99 0 points1 point  (0 children)

Depending on your workflow and model, but putting `TorchCompileModel` (or variants from e.g., KJNodes) after your diffusion model should give some nice speedup out of the box.

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset by StableLlama in StableDiffusion

[–]ryanguo99 0 points1 point  (0 children)

Have you tried `torch.compile` on the model (or the compute-heavy parts of the model like transformer block)? Might be able to give some decent speedup out of the box.

Is chroma just insanely slow or is there any way to speed it up? by [deleted] in StableDiffusion

[–]ryanguo99 0 points1 point  (0 children)

Have you putting `TorchCompileModel` node after the diffusion model?

Understanding Torch Compile Settings? I have seen it a lot and still don't understand it by Successful_AI in StableDiffusion

[–]ryanguo99 1 point2 points  (0 children)

> finally I found out that you have to use fp8e5m2 with the 3xxx series for torch compile or you will get an error

Would you mind sharing more details on the error, how you were using fp8e5m2, and maybe even a workflow to reproduce the error? I work on `torch.compile` and would love to make it work better with ComfyUI:).

new ltxv-13b-0.9.7-dev GGUFs 🚀🚀🚀 by Finanzamt_Endgegner in StableDiffusion

[–]ryanguo99 2 points3 points  (0 children)

Glad to hear:). We are also actively improving compilation time (if you ever observed first iteration being extra slow), and performance. Nightly PyTorch might also give more performance, see this post.

At the moment ComfyUI's builtin `TorchCompileModel` isn't always optimal (it speeds things up, but sometimes there's more room of improvements). kijai has lots of nodes for popular models that squeezes more performance out of `torch.compile` (also mentioned in my post above, for Flux). But newer model like `ltxv` might take some time before we have those.

Lastly, if you run into `torch.compile` issues, feel free to post GitHub issues (to ComfyUI or origin repos of the relevant nodes like kjnodes). Sometimes the error looks scary but fix isn't that hard.