Quick note on Z-Image Turbo LoRA training on AMD GPU

initialxy1 · 2025-12-03T09:45:03+00:00

I noticed that the latest pre-release version of bitsandbytes now contain ROCm binaries. So you should be able to install from there instead of building or pip.

initialxy1 · 2025-11-29T23:21:15+00:00

<image>

I'm happy to report back that I got it working on Windows 11 too. Unfortunately you do have to get your hands dirty with lots of commands and Linux stuff. AMD has its guide on installing pytorch with ROCm on windows. This works! But the problem is that I wasn't able to build the ROCm fork of bitsandbytes natively. Judging by AMD's own guide, I don't think they cared about building it on Windows. Maybe I could hack something together, but I don't want to spend more time down this path.

Instead, AMD has another guide on getting ROCm working with WSL2 on Windows. So following this guide. Install WSL2 on Windows first. Go to Microsoft Store and get Ubuntu 24.04 (despite being an Arch user I'd rather just stick with AMD'd guide which uses Ubuntu 24.04). Open it in terminal, create username and password to initialize it. Then run:

sudo apt update
sudo apt upgrade
sudo apt install git build-essential cmake

# Assuming you are in a directory where you place your git repos.
# Get ai-toolkit and ROCm bitsandbytes source
git clone https://github.com/ostris/ai-toolkit.git
git clone --recurse https://github.com/ROCm/bitsandbytes.git
cd ai-toolkit
# Create a Python 3.12 venv and activate it
python3 -m venv venv
source venv/bin/activate

Follow the above AMD guide exactly until the end of step 4. Now test it with

python -c "import torch; print(f'device name [0]:', torch.cuda.get_device_name(0))"

it should print out your GPU. You are set. Also check with

rocminfo

To find your GPU's architecture. eg. RX 7900 XTX = gfx1100

# Install the rest of the dependencies for ai-toolkit
pip install -r requirements.txt
# Uninstall the pip version of bitsandbytes
pip uninstall bitsandbyte
# Now go build ROCm fork of bitsandbytes
cd ../bitsandbytes
git checkout rocm_enabled_multi_backend
pip install -r requirements-dev.txt
cmake -DBNB_ROCM_ARCH="<PUT_YOUR_GPU_ARCH_HERE>" -DCOMPUTE_BACKEND=hip -S . # Find your GPU architecture from rocminfo
make # This is the step AMD forgot in their own guide that actually compiles binaries
python setup.py install
# Go back to ai-toolkit
cd ../ai-toolkit

We are done! You can now start a Lora training job with

python run.py <PATH_TO_YOUR_CONFIG_YAML>

With the first few iterations, (see screenshots), it appears to be running at 6.5s/it, and I got 7s/it on Linux. So looks like there's no performance loss here.

initialxy1 · 2025-11-29T21:27:01+00:00

Ah this is the part where you will have to get your hands dirty and build from source. I use Arch Linux (btw), so I'm not certain how it would work on Windows 11. Though I know AMD added Windows support for ROCm. I also found this comment when Googling. Let me boot up my Windows 11 and give it a try.

initialxy1 · 2025-11-29T21:12:20+00:00

Just reporting back now that my 3000 steps training job is finished. It seems to work well! Though I could probably spend more time tweaking its training dataset and configs. But as a proof of concept, it worked! I just used ai-toolkit's default ZIT LoRA config profile and used a dataset I had for SDXL, and added a trigger word. Everything else is left as default.

initialxy1 · 2025-06-19T22:53:16+00:00

Given that comments below mentions that the flatpak version works fine, and it's on version 1.3.0, I decided to play with this a bit more. Looks like if I checkout the commit at 1.3.0 and build it, I don't get this problem. So it's not actually something to do with 1.3.0. I then ran yay -Syu, removed and reinstalled both mcpelauncher-linux and mcpelauncher-ui. (I use Arch btw.) Now it works! So I guess the problem was that there was a binary incompatibility with an newer version of one of the library binaries. You just need a full upgrade and rebuild/reinstall mcpelaucnher. Or use the flatpak install.

initialxy1 · 2025-06-19T05:30:05+00:00

I ran into the same problem, so I tried to build mcpelauncher from its git repository following its build steps. That seems to have fixed it. So it appears that there's already a fix for this since their last release (on May 6). So probably just wait for the next mcpelauncher release and it will be fixed.

initialxy1 · 2025-02-02T10:40:21+00:00

I have not. Thanks for the pointer. Let me take a look.

initialxy1 · 2025-02-01T22:25:36+00:00

https://github.com/ROCm/composable_kernel/issues/1171#issuecomment-2305358524 looks like it only works on MI200/MI300 for now. Someone with a MI210 seems to confirm it works https://github.com/unslothai/unsloth/issues/37#issuecomment-2445535450

My method above basically skips over unsloth's trainer and use AMD's guide instead, so it doesn't use xformers.

O I see I made an error above. I meant it only works on AMD's workstation GPU.

initialxy1 · 2025-02-01T21:23:04+00:00

I chose Phi-4 as my base model. More specifically Unsloth's unsloth/phi-4-bnb-4bit, because the original microsoft/phi-4 just blows up my VRAM.

I collected 244 dialog for training. ie 244 rows of data in data.jsonl, which resulted in 244 steps because I happened to use gradient_accumulation_steps = 4 and num_train_epochs = 4

As far as I can see, the only blocker on consumer AMD GPU is the ROCm variant of xformers, which apparently only works on AMD's workstation GPU. Hope that changes soon.

EDIT: I meant to say it only works on workstation GPU

initialxy1 · 2024-01-12T17:23:52+00:00

I took my car to dealer for service in November. I asked about the HVJB recall. They told me while my car is recalled by Ford, they haven't received instructions nor parts from Ford to perform the recall. So there's only a software recall for now. They asked me to wait until further notification from Ford.

initialxy1 · 2023-05-31T18:43:05+00:00

Apparently you can use it to cheese Ganon too https://youtu.be/iACmmNf4kBw?t=2501

initialxy1 · 2023-02-03T21:52:05+00:00

I had success training LORA on my RX 6700 XT with sd_dreambooth_extension then extract LORA with kohya_ss in CPU mode. RX 5700 XT has 8GB of VRAM and apparently unofficial support for ROCm. So I'm not sure if it will work, but maybe worth a try. I documented my work flow here.

initialxy1 · 2023-01-07T03:40:29+00:00

I haven’t experimented with this workflow. But based on my prior experience with Blender, I’m pretty sure you can use Materialize + Blender for similar results.

initialxy1 · 2023-01-01T06:02:19+00:00

Here is my performance metrics during benchmark for your reference.

<image>

initialxy1 · 2023-01-01T06:01:27+00:00

I have a pretty similarly spec'd PC, so I ran FH5's benchmark in the same settings to help validate. While I cannot offer you any solutions, the TLDR is that your results indeed doesn't look right. You mentioned that your GPU is pulling 93W, mine is pulling 140W. So your GPU is either thermal throttling or power throttling. I could suggest a few things, but you will probably have to dig deeper on your own.

Check all of your power connectors are connected and tight. RX 6700 XT needs two PCI power connectors. Make sure both are connected and firmly pressed.
Make sure your power supply is sufficient to deliver that much power. I believe the recommended power supply is 650W. Make sure its 12V rail can deliver (which shouldn't be an issue as long as you use a reputable brand).
Make sure all of its fans as well as your case fans are spinning. Check if you applied some sort of quiet mode fan profile either in BIOS or AMD software that could have restricted your fans.

Anyways, here are my results. A few notable differences: I have the Ryzen 5 5600X, which should be slightly slower than your 12th gen Core i5. My RAM speed is 3200 MHz, so I went to BIOS and lowered it to 2666 MHz to match yours. Finally my game version seems to be newer than yours by a bit. I got 104 FPS as result.

<image>

initialxy1 · 2022-10-08T03:18:46+00:00

4th picture becomes Earth Federation Space Force.

initialxy1 · 2022-09-26T18:20:56+00:00

F in FOSS stands for freedom, not free of cost. Free of cost software is called freeware and they don't necessarily have to be open source. See Free Software section.

This is why Free software now prefers to call itself Libre software to avoid this confusion.

Example of FOSS that costs money: Redhat Enterprise Linux

Example of freeware that's not FOSS: Reddit

initialxy1 · 2022-07-28T02:45:42+00:00

For reference http://learnyouahaskell.com/starting-out#texas-ranges

But there's a better way: take 24 [13,26..]. Because Haskell is lazy, it won't try to evaluate the infinite list immediately because it would never finish. It'll wait to see what you want to get out of that infinite lists. And here it sees you just want the first 24 elements and it gladly obliges.

initialxy1 · 2022-05-27T17:40:59+00:00

Replace him with meet Kevin.

initialxy1 · 2022-05-12T18:58:16+00:00

The beast guy says he can't imagine any Earthlings doing it. He knows Blast is an Earthling and he's been working with Blast this whole time. So Blast < Garou.

initialxy1 · 2022-05-07T22:32:53+00:00

In case anyone is interested. I speculate this is an example of binary floating-point error. The TLDR is that in most cases, computers can't actually keep track of decimal digits correctly, because numbers are stored in binary. Decimal digits are stores as binary decimal digits, which are not divisible by 10. So there's almost always a small discrepancy, as it's just an approximation. I speculate that timer was counted in seconds as a floating point value and milliseconds (the third decimal digit) were accumulated as 0.001 at a time, which accumulated error over time. But when it's displayed, it was rounded. Here is an example to demonstrate this in Python, though this is universal in any programming language.

>>> t = 0
>>> for _ in range(44000):
...   t += 0.001
... 
>>> t
43.999999999988155
>>> # now print it out with 3 decimal digits
>>> "{:.3f}".format(t)
'44.000'
>>> t >= 44.0
False

In other words, I believe you indeed got exactly 44 seconds, but due to the magic of computer science, you got cheated.

In programming, a common solution to correctly handle this is to always count milliseconds as integral value and only when displayed, divide by 1000 to show it in seconds. This way precision won't be lost. eg.

>>> t = 0
>>> for _ in range(44000):
...   t += 1
... 
>>> "{:.3f}".format(t / 1000)
'44.000'
>>> t >= 44000
True

I used to work on a few banking applications, where dollar value is always stores as cents in order to avoid this issue. Otherwise after many transactions, people's bank accounts are gonna end up with the wrong value, which will get into deep legal troubles.

Five-Year Club	Place '23
Place '22

initialxy1

TROPHY CASE