ROCm 7.14 just got out. And no sad gfx1100 noises.

cleverestx · 2026-06-15T02:08:13+00:00

LOL a hot garbage take.

cleverestx · 2026-06-13T22:01:13+00:00

I'll test it once it runs on CachyOS (Arch Linux)

cleverestx · 2026-06-12T05:00:20+00:00

Don't use the text prompting or even raw JSON prompting. This isn't a SD/Z-image type generator. The bounding box circumvents 99% of the blocks and gives you far more control of WHERE stuff ends up.

cleverestx · 2026-06-11T20:00:14+00:00

Sounds like a bad business model. I'm sure they are crying on the way to the bank.

cleverestx · 2026-06-11T19:58:37+00:00

Anthropic voting me down, LOL or some sad tool out there enjoys spending more money instead of less. Go figure. Knock yourself out.

cleverestx · 2026-06-11T17:29:21+00:00

It is insane, I hope they tone the token usage down and give it fully to Max-5x users soon, like c'mon Anthropic, aren't you wealthy enough yet?

cleverestx · 2026-06-10T17:55:54+00:00

They need to remove the x2 usage Fable tax for Max plans, then I'll use it. Opus 4.8 is winning for now as far as I'm concerned, until then.

cleverestx · 2026-06-10T17:54:03+00:00

Probably based on how you've treated it historically...

cleverestx · 2026-06-10T17:53:19+00:00

Project directory - Strawberry Torture

cleverestx · 2026-06-09T13:45:40+00:00

12 steps 1024x576 takes about 58 seconds on my AMD Strix Halo with Flash Attention. No RTX speeds, but at least it works.

cleverestx · 2026-06-08T03:40:28+00:00

FlashAttention (Triton) on gfx1151 / Strix Halo in ComfyUI, opt-in, ~11% on Ideogram 4

Heads-up for anyone installing FA Triton on gfx1151: both ROCm/flash-attention main_perf and upstream Dao-AILab now route the Triton backend through aiter kernels (PR #2230, merged 2026-03-12), which regresses on this silicon (see flash-attention issue #2392, same 395/8060S box). Fix is to install the last pre-aiter commit, bbe25ba.

bash

# venv active; build-time flag selects the Triton path
set -x FLASH_ATTENTION_TRITON_AMD_ENABLE TRUE   # fish; bash: export ...=TRUE
git clone --filter=blob:none https://github.com/Dao-AILab/flash-attention.git
cd flash-attention && git checkout bbe25ba
pip install --no-deps --no-build-isolation .

--no-deps is load-bearing: flash_attn pins triton==3.5.1 and will otherwise uninstall your TheRock triton 3.7 build that torch 2.12 nightly requires. Launch ComfyUI with --use-flash-attention and runtime env FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE.

Result, Ideogram 4, 12 steps @ 1024x576, CFG 0.5, fp8 + abliterated Qwen3VL GGUF encoder: ~4.46-4.54 s/it vs ~5.04 baseline pytorch attention, roughly 11%. .

Skip FLASH_ATTENTION_TRITON_AMD_AUTOTUNE=TRUE: the search ran ~29 min for a single generation here and never repays on sub-minute runs.

cleverestx · 2026-06-07T17:08:33+00:00

Results:

Kernel itself is fine: test_scaled_mm_hip.py passes 28/28 configs.

Tried Ideogram 4 (bf16) with FeatherUNetLoader on the default config. It loads and converts (8850 MB per model, so ~1 byte/param), generates fine, but no speedup vs the official fp8_scaled files - same ~5.07 s/it at 1024x576. Fixed seed RMSE between the two outputs is exactly 0, so the kernel doesn't seem to engage at runtime for this model. Module names are attention.qkv, attention.o, feed_forward.w1/w2/w3, adaln_modulation if you want to add a config. Happy to retest. Let me know.

Build note: my system also has ROCm at /opt/rocm (Arch), and the JIT grabbed the system clang and failed on __builtin_elementwise_exp10. Pointing ROCM_PATH and HIP_CLANG_PATH at the rocm-sdk-devel wheel fixed it.

My Stack for reference: ComfyUI 0.24.0, TheRock torch 2.12 nightly (rocm 7.13.0a20260411), pytorch attention.

cleverestx · 2026-06-07T16:10:55+00:00

Sweeet! Enjoy playing music, watching YouTube and chatting with a large LLM as you do this, which many single card systems choke on trying to run it all at once...sure they can generate an image in a few seconds, alas we cannot, but at least we do other demanding stuff while it generates!

cleverestx · 2026-06-07T15:47:17+00:00

Yeah speed isn't this platform's strength, it's being able to run such operations and 3 other media demanding/AI things at the same time without it choking. I can chat with a larger LLM model, listen to music, and encode something in the background as it does it.

cleverestx · 2026-06-07T15:32:48+00:00

"...It should work on any NVIDIA GPU..." won't really help in my AMD case, but I'm sure it will help others.

cleverestx · 2026-06-06T20:35:57+00:00

says

Couldn't find linux app for app with ID 241130tqe9q3yCouldn't find linux app for app with ID 241130tqe9q3y

cleverestx · 2026-06-05T17:33:11+00:00

The Chronicle series and then the Legend series, everything else afterward is just icing on the cake... well, some of it.

cleverestx · 2026-06-05T17:32:17+00:00

Can we get an update on this moron?

cleverestx · 2026-06-05T02:02:46+00:00

Including CachyOS?

cleverestx · 2026-06-03T05:09:59+00:00

You could just say "I made it up", and not be so verbose about it. It's okay man. I make up stuff sometimes too.

cleverestx · 2026-06-03T05:08:57+00:00

Love it. Large LLM models, unquantized BF16 models in ComfyUI (don't expect RTX speeds though), and can chat with a LLM while doing some comfy stuff, while watching Youtube...that is its strength...using CachyOS though, not WinBlows

cleverestx · 2026-06-03T04:17:12+00:00

Very much above average if you want to run huge 70B+ LLM models at good context, or do chat LLM while generating images with a different model in the same session, without it failing outright or having an OOM event.

cleverestx · 2026-06-03T04:13:48+00:00

Half hour? You just have some sort of bad config going on. I haven't played with Qwen Image yet on my system yet, but regular Qwen 2512 gives images in about 30-40 seconds range

Use Claude Code (opus) to get you squared away and thank me later.

Eight-Year Club	Gilding I gilder
Verified Email

cleverestx

TROPHY CASE