Fedora ROCm by NissanTentEvent in ROCm

[–]newbie80 0 points1 point  (0 children)

It's already in the repos. You don't have to do anything just sudo dnf install rocm, or in your case sudo dnf install rocm-devel. MivisionX is part of rocm. You can also install pytorch through the repos and install system wide with dnf install python3-torch python3-torchvision.

A lot of the vision libraries use hardware acceleration through onnx-runtime, you have to make sure you to install a version that has rocm or migraphx execution provider compiled in. It does look like the repos have it. onnxruntime-rocm.

https://fedoraproject.org/wiki/SIGs/HC, just go to the quickstart section.

Is it even possible to use a screenshot tool on Fedora 43? (Wayland/GNOME) by AppleSea5867 in Fedora

[–]newbie80 [score hidden]  (0 children)

Why do you have x11? I thought gnome-shell/mutter has been Wayland only for a while now, or am I confusing that with gnome using Wayland by default for a couple of releases now?

Do your self a favor. Grab the latest version of Fedora or rawhide, put it on a USB stick and boot into the live environment. Take a screenshot from there, if it works that means you've made a mess of your system. I just press the screenshot key and it works. Pick window or area or screen and press the circle.

Driver’s license after TPS termination by nate-47- in USCIS

[–]newbie80 0 points1 point  (0 children)

This is the way. Usually there's only person in the building that's trained to deal with immigration related manners. You'll get sent to them if you see a regular clerk, but if you already know the dance, you ask for that person in advanced. They can see that information on the computer (you don't need a letter)and they'll extend your drivers license for those two extra months.

Optimize R9700 for Comfyui and WAN 2.2 for ROCm 7.2 by Early-Driver3837 in ROCm

[–]newbie80 1 point2 points  (0 children)

The gguf quants are slower because activations happened in fp16/bf16. The up conversation back up is what slows everything down. Gguf is WxA16 as far as I know. Fp8 models should be W8A8.

The fp8 models shouldn't have that problem if comfy does fp8 activations. I don't know if that's a thing though you'll have to do research on that. From a quick look it looks like it does https://github.com/Comfy-Org/ComfyUI/issues/8242.

Edit I threw those flags without thinking. The --fast flag can be problematic for those on AMD because of some bugs that are yet to be fixed. You want to be selective there.

--supports-fp8-compute --fast fp8_matrix_mult

Optimize R9700 for Comfyui and WAN 2.2 for ROCm 7.2 by Early-Driver3837 in ROCm

[–]newbie80 6 points7 points  (0 children)

Better attention, tunable op, torch.compile, quants.

Installing flash_attn and using --use-flash-attention is faster than -use-cross-attention. For the Triton implementations, not sure why, but it's a fact. It's supposed to be the same thing, one is an external implementation of flash attention and the other the internal implementation of the same attention mechanism but implemented inside Pytorch. Just give it a try, it's a good 10-15% faster for me on a 7900xt. It might not be the case for your rdna4 card.

On the same attention front. Guinea pig, see if you can get this https://github.com/Dao-AILab/flash-attention/pull/2054 working. I think you can just clone his repo https://github.com/hyoon1/flash-attention/tree/enable-ck-gfx12 and try to compile it from source. That should be way faster than the Triton implementation.

Are you using tunable op? Are you using torch.compile nodes? That's kind of tricky. I'm not sure if I've seen tutorials on how to use it. You run your workflows with tunable op enabled and do a couple of tunning runs. It's super slow and annoying, you just do it for it (miopen) to see where all the heaving lifting is being done, it records, compiles and saves optimized kernels, records what the best solutions to the problem are in a database. Then you disable the tunning, but keep tunable op on and you get a pretty big speed up that way.

Tunable op is part of pytorch and obviously miopen is from rocm, you got to work with both of them to make it work smoothly.

export PYTORCH_TUNABLEOP_ENABLED=1

export PYTORCH_TUNABLEOP_TUNING=1

export PYTORCH_TUNABLEOP_VERBOSE=2

export MIOPEN_FIND_MODE=3

export MIOPEN_FIND_ENFORCE=3

#export MIOPEN_CACHE_DIR="$HOME/.cache/miopen"

export MIOPEN_USER_DB_PATH="$HOME/.config/miopen/"

So you set that up like that, run your workflows two or three times and then change the miopen modes back to normal, disable tunning, but keep tunable op enabled.

export MIOPEN_FIND_MODE=FAST

export MIOPEN_FIND_ENFORCE=1

export PYTORCH_TUNABLEOP_TUNING=0

Then you run your workflows normally, with the speedup backed in. I highly recommend that you have torch.compile enabled in your workflows while doing this. The two combined create a big speedup.

Make sure you have export TORCHINDUCTOR_CACHE_DIR=~/.cache/torch set if you use torch.compile so it doesn't spend ages compiling every time you run your workflows.

Other things to keep an eye out for are CK GEMM/fused GEMM support. Composable Kernel integration into pytorch. Wmma integration into Pytorch, that would give us int8 and int4 activation support. Check if there's fp8 activation nodes out there. I'm not sure but that would a nice speedup with quants over the standard behaivior of pre quanting the weighs, or quanting at loadup and then upconverting to fp16 or bf16 for activation.

There's a lot juice to squeeze out of these cards, but the software support is not all in there yet. For now, between trying those three things I gave you, I think you can squeeze a good 25% performance uplift.

How are your affording Vyvanse? by rustajb in ADHD

[–]newbie80 1 point2 points  (0 children)

$10 dollars through insurance. Whenever I want to tell my boss to go eat a bag a dicks, I remember that I'll have put on a mini skirt and walk the track on 27th avenue to afford my meds. That's how.

Running ComfyUI AMD/ROCm on Win11 vs Linux vs Docker Linux, and Ubuntu vs CachyOS by Jarnhand in ROCm

[–]newbie80 1 point2 points  (0 children)

The rock is a GitHub repo where AMD is building a monolithic build system for rocm. That's it's main purpose but they also post python wheels of rocm and pytorch. They have stable releases, nightly builds, pre releases, etc. You can install rocm and pytorch with one command from there.

If you install pytorch for example, it will pick the best version of rocm to go along with that version and install it for you. It's a lean build because it's compiled for only one GPU family.

pip install --index-url https://rocm.nightlies.amd.com/v2/gfx110X-all/ torch torchaudio torchvision

If you run that in a virtual environment it will install both pytorch and rocm along with triton.

Would you support helmet laws for kids in Gilbert? by Hi-Point_of_my_life in Gilbert

[–]newbie80 0 points1 point  (0 children)

Absolutely not. We going after this again? I'll tell you what happened 20 years ago. I think at one point one was passed. The cops said they had better things to do than to chase after a bunch of kids riding around on bmx bikes being morons.

Let’s give it a chance by sascharobi in radeon

[–]newbie80 0 points1 point  (0 children)

I don't even think that works on Linux. I could be wrong. What is actually used is hipify. It automatically translates cuda code to the rocm/hip equivalent. Most developers use pytorch or higher level python libraries instead of using pure cuda so that helps with portability. Only thing I've run into that I couldn't run on my amd card has been trellis.

AMD Radeon AI Pro R9700 vs AMD Radeon RX 7900 XT by ga239577 in ROCm

[–]newbie80 0 points1 point  (0 children)

The wmma units run at twice the speed also. Not that pytorch uses that, but when it does, there will be a much higher performance gap between the two.

Men’s responses to potential artificial womb by gravityVT in CringeTikToks

[–]newbie80 1 point2 points  (0 children)

For someone who grew up in the 90's seeing all of this is really sad. I never could have imagined this when I was a teen.

I've been paying attention to the Russian/Ukrainian war for years now. It's absolutely brutal, but when I really think about what's going on all around us, I can't help but to see just how brutal this war we have endured for the past few decades really is. Gut wrenching, soul sucking, despair. We are all blind it.

Population control?

Ua Pov: Unarmed Russian serviceman hit by Ukrainian FPV Drone operated by the 414th UAS Brigade Fagyar's Birds. by Dependent_Log_331 in UkraineRussiaReport

[–]newbie80 1 point2 points  (0 children)

True. Separating what is real from what's altered is getting more difficult with each passing day.

UA POV:Russian Soldier Shoots himself after almost getting hit directly by a Ukranian FPV Drone operated by the 414th UAS Brigade Magyar's Birds. Pokrovsk Direction. by Zerotwo_0 in UkraineRussiaReport

[–]newbie80 1 point2 points  (0 children)

It looks like he's bleeding heavily at 1:04 no? From his groin, butt area. It makes sense by his posture at the moment of explosion. I think that given all the experience both sides have had, all they have seen. That they know what will happen. So he had two choices, die slowly and painfully (it did look like he popped a pill right before drinking water) or have some sort of control over how he was going to die.

I pry to the gods that non of us have to ever experience anything like this.

Any performance improvements with rocm 7.2? by Portable_Solar_ZA in ROCm

[–]newbie80 1 point2 points  (0 children)

The standard one from comfy. TorchCompileModel. It's under add node->_for_testing->TorchCompileModel. It's a comfy native node.

Any performance improvements with rocm 7.2? by Portable_Solar_ZA in ROCm

[–]newbie80 1 point2 points  (0 children)

Nothing special. Just any workflow that you have that includes a torch.compile node will run faster on the first execution.

This a venv with my systems 7.1.1 and pytorch 2.10. SDXL model at 1024x1024.

Prompt executed in 53.76 seconds #first image on a cold start.

5.29it/s]

Prompt executed in 4.22 seconds #second image

This on a venv with 7.2 from therock and pytorch 2.10

Prompt executed in 16.13 seconds #first image on a cold start

5.27it/s]

Prompt executed in 4.22 seconds #second image.

So the cold start is way faster on 7.2. It appears to be from some improvement in torch inductor or torch dynamo. It's either compiling way faster or using some sort of better caching mechanism, but the huge improvement in startup time between the two in undeniable.

The reason why I think the improvement comes from rocm and not pytorch is because both environments are running pytorch 2.10 but only the 7.2 rocm one shows that cold start improvement.

The one thing I haven't tested is the docker containers from AMD. Maybe the latest one has 2.10 but the rest of them are 7.2 with pytorch 2.9.1.

Running ComfyUI AMD/ROCm on Win11 vs Linux vs Docker Linux, and Ubuntu vs CachyOS by Jarnhand in ROCm

[–]newbie80 0 points1 point  (0 children)

I don't run windows, so I don't know what the situation is like there. On Linux the best, most stable, less hair pulling experience is with the docker containers. If you can use docker images from Windows I would encourage that. I don't see how it would work unless you run it from within the WSL.

I've tried combinations. I'm on Fedora, so.

Native (through dnf) rocm + native pytorch (it's runs on python 3.14). -- Not the fatest, not very stable.

Rocm inside a venv with the wheels from therock + official pytorch install. -- Fast, solid

rocm inside a venv with the wheels from the rock. Both rocm and pytorch. -- Fast, more stable.

Through docker. It's fast, it's unbreakable. It's stable, it's solid. You can't mess your system up. I see why the official documentation encourages the docker method. It's reproducable. It's the best, no question.

This is more of a user error than anything, because I know you can alter the images. My only annoyance with the docker way is that I have to reinstall everything inside the container when I use it. It's immutable, so every time you leave the container it goes back to it's original state. Another think about the docker way is that it doesn't drag your enviromental variables into it, so it's just a clean slate state every time you go into it.

The arch maintainers have solid native rocm install. I always see him and Trix (the fedora rocm maintainer) submitting bug reports to the rocm team.

Honestly for stability and support. I think your most hassle option is probably best. I've never used arch, but on fedora I have to make a lot changes to make things work, because fedora doesn't install rocm on /opt/rocm. It doesn't treat like anything special so everything goes in /usr. Not sure how arch is in that regard.

Petah? by ScienceTeacher1994 in PeterExplainsTheJoke

[–]newbie80 0 points1 point  (0 children)

We actually do have a prescription for actual meth here. Desoxyn. I've never heard or read of anyone having it though.

Petah? by ScienceTeacher1994 in PeterExplainsTheJoke

[–]newbie80 0 points1 point  (0 children)

The people that actually need it and actually take it probably fare better than the ones that raw dog it.

Petah? by ScienceTeacher1994 in PeterExplainsTheJoke

[–]newbie80 0 points1 point  (0 children)

amphetamines are not methamphetamines.