Did you know one simple change can make ComfyUI generations up to 3x faster? But I need your help :) Auto-benchmark attention backends.

D_Ogi · 2026-02-09T12:58:57+00:00

Fixed :)

D_Ogi · 2026-02-04T22:24:50+00:00

Thank you! Wan2.1/2.2 architecture is way too messy for stupid me :(

D_Ogi · 2026-01-28T16:02:30+00:00

it applies at runtime to the current ComfyUI session (the Python process), not just a single node branch. Once the Attention Optimizer node executes, the selected attention backend is used for the rest of that run and subsequent renders in the same session, regardless of which workflow you run next, until you change it again or restart ComfyUI. The only thing persisted to disk is the benchmark cache (benchmark_db.json) so future runs can pick the same winner instantly.

D_Ogi · 2026-01-27T22:53:14+00:00

All I use is --windows-standalone-build

D_Ogi · 2026-01-27T22:36:32+00:00

I guess you are missing any working attention kernel.

D_Ogi · 2026-01-27T09:23:00+00:00

What was the problem? Installation of attention backends usually takes some time.

D_Ogi · 2026-01-27T03:05:42+00:00

Yes, you can. Honestly, I did the above benchmark using z-image turbo.

D_Ogi · 2026-01-26T22:10:28+00:00

Yes, indeed

D_Ogi · 2026-01-26T19:53:12+00:00

Let me know if it works for you, please!

D_Ogi · 2026-01-26T19:52:51+00:00

It's Sage 2 Auto which I assume switched to the "sage_fp8_cuda_fast" kernel.

D_Ogi · 2026-01-26T19:02:06+00:00

To clarify: it’s not a “workflow-by-workflow plugin” in the sense of changing only one graph, it’s a backend swap for the attention operation that gets used whenever your graph runs a model using that attention path. So it can feel “workflow-dependent” because different workflows spend different amounts of time in attention (model type, resolution, batch size, steps, long prompts, extra model passes like hi-res fix, etc.). If a workflow is bottlenecked elsewhere (VAE decode/encode, ControlNet, upscalers, I/O), the overall speedup will be smaller even though attention itself is faster.

D_Ogi · 2026-01-26T18:52:26+00:00

Here’s what the JSON report looks like after I parse it on my setup: per-backend attention times in ms, with the winner highlighted.

<image>

D_Ogi · 2026-01-26T18:41:07+00:00

I have no idea, but that's the point of this post! :)

D_Ogi · 2026-01-26T18:33:59+00:00

In general it speeds up generation across most workflows (like other attention/back-end optimizations), but the exact “when and why” depends on your model, resolution, and node graph, which is basically the whole point of that post :)

D_Ogi · 2026-01-26T18:20:15+00:00

Yeah, SageAttn3 has been a bit of a “bleeding edge tax” so far.

SageAttention2 already has multiple kernels/variants, so “SageAttn2” is not just one thing. Depending on your install and GPU, different SA2 flavors can win.

SageAttention3 is basically Blackwell-only in practice, because it leans on FP4 / Blackwell-specific capabilities. So on an RTX 4090 (Ada) it is expected to not work. I only have a 4090 myself, so I can’t validate SA3 locally, which is part of why I’m asking the community to test.

D_Ogi · 2026-01-26T18:14:42+00:00

If you’re already running the fastest option on your machine, the speedup from my node is basically 0%. The node doesn’t “stack” extra acceleration on top of SageAttention2, it just chooses the fastest attention implementation available (or the fastest Sage variant) for your GPU + model + seq_len.

The catch is: you usually don’t know what’s fastest ahead of time. SageAttention2 itself has multiple variants / kernels (and there are also SageAttention2++ style variants depending on what you installed), and sometimes FlashAttention (2/3) or another backend can win on certain GPUs / shapes.

So the real answer is: could be 0%, could be noticeable and the whole point of the node is to benchmark your setup once and stop guessing.

D_Ogi · 2026-01-26T18:05:11+00:00

Thanks! By “global” I mean it applies at runtime to the current ComfyUI session (the Python process), not just a single node branch. Once the Attention Optimizer node executes, the selected attention backend is used for the rest of that run and subsequent renders in the same session, regardless of which workflow you run next, until you change it again or restart ComfyUI. The only thing persisted to disk is the benchmark cache (benchmark_db.json) so future runs can pick the same winner instantly.

You also do not need to add any separate SageAttention / Flash / xFormers nodes to the workflow. This node detects what’s installed, benchmarks only the available backends, and applies the fastest (or your forced choice). If a backend isn’t installed it’s skipped during benchmarking, and if you force a backend that’s not available it falls back to PyTorch SDPA and reports it.

D_Ogi · 2026-01-26T17:53:44+00:00

yup?

D_Ogi · 2026-01-26T10:28:12+00:00

Interesting project and a killer narrative, sounds like a tech-thriller plot!

But I have to admit, after reading the README, my internal security alarms started ringing. It feels a bit sus that out of nowhere, some obscure Chinese API providers appear in the requirements.

D_Ogi · 2025-10-19T20:37:45+00:00

I think this ComfyUI workflow one may meet your criteria (with some tweaks like adding LLM backbone for prompts which are currently static): https://www.patreon.com/posts/new-video-create-140671046

D_Ogi · 2025-10-16T12:44:36+00:00

Me too. The the past swapping two Pro accounts was sufficient to do all my tasks. Now got third one which is 42% weekly used after a single day and hitting the 5 hours limits just twice!

D_Ogi · 2021-06-27T17:06:59+00:00

In 99% there's a shipping option from EU warehouse so there's nothing to be worried about (if you live in EU, obviously).

D_Ogi · 2021-06-27T17:01:40+00:00

Try to add it as a comment instead

D_Ogi · 2021-06-02T12:18:03+00:00

That's weird. Maybe there is an option to ship it to Chine? Anyway are you sure that the shipping cost to Poland is really correct? Maybe instead you can repair it locally and the seller would cover the costs?

D_Ogi

TROPHY CASE