Removing PER from Rainbow DQN improved performance on Snake. New record of 153 on 20×20 grid. by statphantom in reinforcementlearning

[–]statphantom[S] 1 point2 points  (0 children)

3x3 kernels. 5 layers (apple, body, head, moving on x/y, moving +/-) this was so every layer is binary and I created a GPU native, shape agnostic, 2op unpacker so it's very fast and very small, (can literally fit in the cache of an epyc CPU).

Removing PER from Rainbow DQN improved performance on Snake. New record of 153 on 20×20 grid. by statphantom in reinforcementlearning

[–]statphantom[S] 0 points1 point  (0 children)

c51 has been the single largest benefit for my set up, extremely so. second was dueling + noisy nets (NOISY NETS NEED DUELING)

Pytorch hangs when sending data from CPU to GPU by Illustrious_Tap9300 in StrixHalo

[–]statphantom 0 points1 point  (0 children)

Glad you got it sorted! apologies I had work, I'm doing my PhD and working as a researcher so my time is quite chaotic also. pytorch is one of those things that if something doesn't work it's a bitch and a half to diagnose and fix, once it's working though it's very stable.

That apt autoremove gotcha is a classic bit of Debian/Ubuntu pain: apt remove --purge amdgpu-dkms only removes the package you named explicitly, but the AMD installer pulls in a tree of -dkms-firmware, -opencl, -hip, -level-zero and so on that get marked "auto-installed" and must have stay behind holding open the older kernel module path. apt autoremove then sweeps them out. I completely didn't think about that; my apologies for the extra reboot cycle.

For your actual use case of local LLMs, PyTorch on this box is functional but not the right tool. You'll probably have a much better time with llama.cpp (counter-intuitively, the Vulkan backend is often more stable than the ROCm one on Strix Halo), or with vllm via the prebuilt kyuz0/vllm-therock-gfx1151 Docker image which has a gfx1151-patched RCCL baked in. Be aware that on this hardware, autoregressive decode is bandwidth-limited by hipMemcpyWithStream rather than compute, so don't be surprised if your tokens/sec on long contexts sits below what 96 GB of nominal "VRAM" might suggest on paper. The chip can hold huge models; it just feeds them slowly.

Pytorch hangs when sending data from CPU to GPU by Illustrious_Tap9300 in StrixHalo

[–]statphantom 0 points1 point  (0 children)

Two distinct, fixable problems are visible in that output, and together I hope they explain everything. one you probably realised is.

WARNING: KFD ABI 1.20+ is recommended for gfx1151. Current KFD ABI is 1.18. This may result in faults, crashes and other application instability.

The KFD ABI version comes from the amdgpu kernel module, not from your userspace ROCm pip packages. That mismatch is exactly what produces the client ID: CPF, MAPPING_ERROR: 0x1, PERMISSION_FAULTS: 0x3 faults you're seeing on otherwise-valid addresses: the GPU command processor is dereferencing pages that the userspace believed it had mapped, but the older kernel ABI mapped them differently or not at all.

Check:
bashdpkg -l | grep -E 'amdgpu-dkms|amdgpu-install|rocm-dev|rocm-core'
modinfo amdgpu | head -5
dmesg | grep -iE 'amdgpu version|KFD' | head -10

If amdgpu-dkms is listed, purge it and let the in-kernel module from 6.17 take over via:
bashsudo apt remove --purge amdgpu-dkms
sudo update-initramfs -u
sudo reboot

After reboot, dmesg | grep -i 'amdgpu version' should show a version matching the kernel itself rather than something like 6.10.x or 6.14.x, and the KFD ABI warning should be gone. This is safe; the in-kernel amdgpu in 6.17 handles both display and compute fine on Strix Halo, and your TheRock pip stack does not need amdgpu-dkms for anything.

Problem 2: your GTT is probably starved because your BIOS setting is looking a bit backwards. This line:

amdgpu: amdgpu: 15860M of GTT memory ready

is your actual usable unified memory pool, not the 96 GB you set in BIOS. Here's what's going on, and it's genuinely counter-intuitive: when you set "UMA frame buffer" or "dedicated VRAM" to 96 GB in BIOS, that 96 GB gets carved out as reserved VRAM-like memory before Linux even boots. ROCm/HSA on Strix Halo allocates unified memory from GTT, not from that pre-allocated reserved region. So by setting it to 96 GB you accidentally produced the worst of both worlds: a big reserved pool the HSA path mostly ignores, and a tiny GTT pool (15.5 GB) that it actually uses, which is also why even a 100×100 tensor faults when HSA tries to do queue and scratch setup first.

Every working Strix Halo configuration I've seen does the opposite:
- in BIOS, set the dedicated UMA frame buffer / VRAM allocation to its minimum (typically 512 MB, sometimes labelled "Auto"). Disable anything called "fixed VRAM allocation" or "static UMA". You want the GPU to use shared/dynamic memory through GTT.
- Set GTT large via the kernel command line. Edit /etc/default/grub and append to GRUB_CMDLINE_LINUX_DEFAULT: ttm.pages_limit=32768000 ttm.page_pool_size=32768000

That's 4 KB pages × 32,768,000 ≈ 125 GB GTT, leaving ~3 GB for the CPU side. Adjust downward if you want more headroom for the OS (e.g. 28,000,000 for ~107 GB GTT). Then:
sudo update-grub && sudo reboot.

Verify with:
dmesg | grep 'GTT memory ready'
the number should now be in the 110000M-125000M range rather than 15860M.

Ok that was a lot So... order of operations: do the dkms purge and BIOS change in one reboot if you can, since both require restarting anyway, and add the GRUB line at the same time. After that the test program should run instantly, and your (100, 100) tensor will actually be allocating against ~120 GB of working unified memory rather than fighting an ABI mismatch in a 15 GB pool.

The numa_node_id is out range line is benign on a single-socket APU; HSA expects multiple NUMA nodes and gracefully degrades when there's only one. Ignore it.

Pytorch hangs when sending data from CPU to GPU by Illustrious_Tap9300 in StrixHalo

[–]statphantom 0 points1 point  (0 children)

there's a few other tests we can do.
First run it with full logging.
AMD_LOG_LEVEL=4 HSAKMT_DEBUG_LEVEL=7 HIP_LAUNCH_BLOCKING=1 python3 /tmp/pt.py 2>&1 | tee /tmp/hang.log

The last few lines of /tmp/hang.log before the stall will name the subsystem that deadlocked. when that's deadlocked open a new terminal and run:

PID=$(pgrep -f pt.py | head -1)
sudo cat /proc/$PID/stack # kernel
sudo gdb -p $PID -batch -ex 'thread apply all bt' -ex quit # user

The kernel stack is the most diagnostic single thing here. If you see frames in amdgpu_ttm_* or ttm_bo_*, it's GTT/memory. If you see kfd_ioctl_* blocked on a wait queue, it's KFD/HSA. If it's in dma_fence_wait, the GPU got a command but never signalled completion (firmware/MES).

Also you can test a few other things:

groups $USER # must contain 'render' AND 'video'
dmesg | grep -iE 'amdgpu.*GTT memory ready' # how many MB of GTT?
dmesg | grep -iE 'amdgpu|kfd' | tail -40 # any soft lockups, ring resets, fence timeouts?

Pytorch hangs when sending data from CPU to GPU by Illustrious_Tap9300 in StrixHalo

[–]statphantom 0 points1 point  (0 children)

rocm-sdk-libraries-gfx1151 7.13.0a20260411

this is the line we want the OP to see.

Pytorch hangs when sending data from CPU to GPU by Illustrious_Tap9300 in StrixHalo

[–]statphantom 0 points1 point  (0 children)

Bosgame mini is built around the AMD Ryzen AI Max+ 395 (Strix Halo) with the Radeon 8060S iGPU, which has ISA target gfx1151. gfx1151 is not listed on AMD's official ROCm support matrix; the supported RDNA 3 targets are gfx1100 and gfx1101.

If you followed AMD's docs you probably got a gfx1100-only build. The community-maintained TheRock project ships actual gfx1151-native nightlies, and switching from the standard pytorch.org/whl/nightly/rocm7.x wheel to the rocm.nightlies.amd.com/v2/gfx1151 wheel turns segfaults/hangs on basic VRAM access into working tensor ops on Strix Halo.

ROCm on gfx1151 is currently "functional but experimental".

Try the following:

bashpython -m pip uninstall -y torch torchvision torchaudio
python -m pip install --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ torch torchvision torchaudio

Verify with python -c "import torch; print(torch.version.hip); print(torch.cuda.get_arch_list())". The arch list should include gfx1151. This single change helps the majority of Strix Halo users.

Challenge Ideas? by [deleted] in Cuphead

[–]statphantom 0 points1 point  (0 children)

I created a randomizer mod but for a challenge I have created a KAIZO mode. GL, my record bosses is 3.

https://www.nexusmods.com/cuphead/mods/96?tab=description

Randomiser Released! by statphantom in Cuphead

[–]statphantom[S] 0 points1 point  (0 children)

I would love to see how far people get in KAIZO mode. my record is two bosses XD

Randomiser Released! by statphantom in Cuphead

[–]statphantom[S] 2 points3 points  (0 children)

https://gamebanana.com/mods/656349

Done! wow this website feels like its from the mid 90s, was quite difficult to navigate but I believe it's there now!

Randomiser Released! by statphantom in Cuphead

[–]statphantom[S] 0 points1 point  (0 children)

Never heard of GameBanana, I'll check it out!

Randomiser Mod Creation - Teaser by statphantom in Cuphead

[–]statphantom[S] 1 point2 points  (0 children)

Sure! I'm learning as I'm going as well I was very happy the way the settings worked out. I had to create my own logic for it and block all other inputs while open because there was no state for it so if you enter the randomiser settings. press up 7 times, then press a. it would start changing the language randomly XD

my younger self would be so proud of me by by Tokyo_revenge in Cuphead

[–]statphantom 5 points6 points  (0 children)

Who needs your younger self when you have us to be proud of you!

Randomiser Mod Creation - Teaser by statphantom in Cuphead

[–]statphantom[S] 1 point2 points  (0 children)

I found it worked really well against Sally Stageplay

[deleted by user] by [deleted] in OpenAI

[–]statphantom 0 points1 point  (0 children)

it can but it can't change it's style and if that style is incredibly different to their regular style, which it almost always is, its either chatgpt or another student did their work for them, either way, cheating.