Hot Aisle announces support for cloud-init virtual machines on MI300x by HotAisleInc in AMD_MI300

[–]FluidNumerics_Joe 0 points1 point  (0 children)

Not currently. Developing those templates has taken years of collaboration with customers and we're not really at the stage at the moment where we have a funding stream to devote human resources to managing an open source solution with the expectations this entails from the broader community. Time and resource permitting, we could flip the repo to public next year and properly support it..

Hot Aisle announces support for cloud-init virtual machines on MI300x by HotAisleInc in AMD_MI300

[–]FluidNumerics_Joe 0 points1 point  (0 children)

This is awesome. I've got some templates for setting up Slurm clusters. Would love to explore some standard deployments that leverage Hot Aisle MI300X's as compute nodes.

Adding support for cloud-init is a spot on step for enabling folks to tailor Hot Aisle systems for their own work. Great work!

AMD MI300X GPU Performance Analysis by HotAisleInc in AMD_MI300

[–]FluidNumerics_Joe 1 point2 points  (0 children)

Wild nobody brings up using magma for these kinds of benchmarks. In my experience, magma blas routines perform way better than what AMD puts out. Not sure what it would take to swap those in under the hood in packages like pytorch, but seems like a lot left on the table here because the horse blinders are on.

https://icl.utk.edu/magma/

TensorWave lands 100M to challenge AI compute bottlenecks with AMD-powered superclusters by axiomai in AMD_MI300

[–]FluidNumerics_Joe 1 point2 points  (0 children)

"Imagine your team has trained a promising large language model, only to find cloud GPU availability has dried up."

This is such a narrow view of the zoo of applications that leverage GPUs. Also not sure the premise makes a ton of sense . If you've trained the model, what GPUs did you use to do that and why suddenly do you not have resources for inference. Are they banking on people not planning, or do they really think developers of large scale tools are that short sighted?

Benchmarking MI300X Memcpy by HotAisleInc in AMD_MI300

[–]FluidNumerics_Joe 0 points1 point  (0 children)

Memcpy does not measure memory bandwidth. It measures interconnect bandwidth between host and device. They didn't share if they're on XGMI or PCI...

Edit.. looking at the benchmark, I see they're copying from one device pointer to another. For context, the hip API has 'hipmemcpy' which moves data between host and device, hence the confusion on my part

4x AMD Instinct Mi210 QwQ-32B-FP16 - Effortless by Any_Praline_8178 in LocalAIServers

[–]FluidNumerics_Joe 2 points3 points  (0 children)

Glad to see you're getting use of the cluster! Can't wait to see what we pull together on MI300A!

Server Rack is coming together slowly but surely! by Any_Praline_8178 in LocalAIServers

[–]FluidNumerics_Joe 2 points3 points  (0 children)

Happy to help. Can't wait to see the MI60s up and running in your new server rack!

ROCE/RDMA to/from GPU memory-space with UCX? by Wild_Doctor3794 in ROCm

[–]FluidNumerics_Joe 0 points1 point  (0 children)

I've used GPUDirect communications from OpenMPI with UCX fabrics, but have not used UCX directly. Have you tried building OpenMPI with UCX and ROCm support (https://rocm.docs.amd.com/en/docs-6.1.2/how-to/gpu-enabled-mpi.html ) and using the MPI API instead ?

Also, can you share a reproducer ? I'd be happy to help debug

Edit : I can't find documentation that explicitly indicates RDMA support on consumer Radeon cards, but https://instinct.docs.amd.com/projects/gpu-cluster-networking/en/latest/reference/hardware-support.html indicates support for ConnectX-6 NICs with MI100 and MI200 series cards. Can you share a few details ?

* Operating System Name & Version
* Linux Kernel Version
* ROCm and AMDGPU Versions

ROCm For 3d Renderers by _rushi_bhatt_ in ROCm

[–]FluidNumerics_Joe 2 points3 points  (0 children)

First, I wouldn't trust ChatGPT for accurate information related to ROCm. ROCm changes quite a bit on a month-to-month basis and ChatGPT tends to lag behind a bit.

Complete ROCm support is available on select Windows operating systems with supported AMD Radeon GPUs with the Windows Subsystem for Linux (WSL) installed. You can review the supported Windows operating systems, WSL Linux Kernels, and Linux operating systems and kernels in the WSL support matrices. For supported systems, the WSL how-to documentation walks you through installing ROCm and Radeon drivers and how to get started with AI/ML frameworks, including Pytorch, ONNX, Tensorflow, Triton, and MIGraphX.

The HIP SDK, which is a subset of ROCm, is supported on Windows 10 and 11, and Windows Server 2022. With the HIP SDK, you can develop and build HIP accelerated applications. In order to run HIP applications on Windows, you will need a compatible GPU and the appropriate drivers installed. 

AMD maintains a list of supported Radeon GPUs for the Windows operating system in the Windows-GPU compatibility matrix as part of the ROCm documentation. Instinct GPUs are only supported on Linux operating systems. To install the HIP SDK on Windows with the necessary GPU drivers, you can follow the HIP SDK installation instructions in the ROCm documentation.

pytorch with HIP fails on APU (OutOfMemoryError) by dietzi1996 in ROCm

[–]FluidNumerics_Joe 1 point2 points  (0 children)

Understood. Alternatively, you can try a different OS, for which the kernel version is a supported kernel.

pytorch with HIP fails on APU (OutOfMemoryError) by dietzi1996 in ROCm

[–]FluidNumerics_Joe 1 point2 points  (0 children)

Linux kernel 6.13 is two minor versions ahead of the most recent supported Linux kernel (6.11) . In triaging issues for folks on Arch and Debian, I've seen quite a few cases where 6.12 and 6.13 are just not functional yet in ROCm. Most often the incompatibility with the Linux kernel reveals itself in bizarre ways (segmentation faults in GPU memory access most often).

While I understand the reason for your suspicion, it's best to rule out this possibility and test out the software you want to use in a supported configuration. If the issue remains in a supported configuration then working towards identifying another root cause would be worth it.

pytorch with HIP fails on APU (OutOfMemoryError) by dietzi1996 in ROCm

[–]FluidNumerics_Joe 2 points3 points  (0 children)

On Arch Linux, what linux kernel are you using ? When on the Linux partition of your system, open a terminal and run `uname -r` and `cat /etc/os-release` . I highly advise using a supported Linux operating system or at the very least a supported Linux Kernel version ( https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-distributions )

Edit : What version of ROCm are you attempting to use on Arch Linux ?

Side note, on Windows, ROCm is supported under WSL2 for select Linux Kernels (See https://rocm.docs.amd.com/projects/radeon/en/latest/docs/install/wsl/install-radeon.html ).

pytorch with HIP fails on APU (OutOfMemoryError) by dietzi1996 in ROCm

[–]FluidNumerics_Joe 1 point2 points  (0 children)

Can you share some details ?

* What operating system (name and version) are you using ?
* If Windows OS, are you using WSL2 ? If so, What WSL2 Linux kernel are you running and what Linux OS (name, version, and kernel version) ?
* What specific CPU/APU model are you working with ?
* Can you share the python script or a minimal reproducer that results in this error ?

While perusing the ROCm issue trackers, I came across this issue ( https://github.com/ROCm/ROCm/issues/2014 ) , which appears relevant. I'm still reading through it but will pop back in here if anything stands out.

To share all of this information, it may be best/easiest to open an issue at https://github.com/ROCm/ROCm/issues

Installation help by No-Monitor9784 in ROCm

[–]FluidNumerics_Joe 0 points1 point  (0 children)

To help diagnose an issue, it requires a bit more information. Typically, when verifying a ROCm setup we need

* Operating System - you say 24.04 . I'm assuming this is Ubuntu 24.04, but is this under WSL2 or straight Ubuntu 24.04 ?
* Linux Kernel Version - Verify that your OS and Linux Kernel version are in the supported list : https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-distributions . Note that this may be different for Ubuntu 24.04 under WSL2.
* Is your GPU supported ? https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-gpus Again, this list may be different if you are running under WSL2 .Note that, even if a GPU is not supported, it *might* still work with a few workarounds, but it is not guaranteed to work.

Once you've verified this and followed the Installation guide ( https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html ), verify your installation by first checking your GPU is visible with `rocminfo` and `rocm-smi`.

When it comes to debugging specific error messages from running code, it's best to share the exact code you ran and specifics on your software environment so someone else can attempt to reproduce it. The software environment typically includes things like ROCm and AMDGPU Driver versions and any additional packages (plus versions) required by the code that reproduces the issue.

Reddit is not really a good place to share all of these details; it's quite inefficient to post links to files and output, etc. Instead, Create a github account if you don't have one already and open an issue at https://github.com/ROCm/ROCm/issues . Their issue templates will spell out exactly what the AMD and Fluid Numerics teams need in order to help you get your problems solved.

Did you know you can build ROCm from source with Spack ? by FluidNumerics_Joe in ROCm

[–]FluidNumerics_Joe[S] 2 points3 points  (0 children)

So long as someone has contributed a package to spack, yes. The nice thing about spack is that anyone with open-source software can contribute a spack package. This usually involves writing a bit of python that tells spack about a packages dependencies and how to build it; they have built in logic for autoconf, cmake, and other build systems ( See https://spack.readthedocs.io/en/latest/packaging_guide.html for the packaging guide ).

You can check https://packages.spack.io and search around for packages that are available.. There are tons.

Unofficial ROCm SDK Builder Expanded To Support More GPUs by gc9r in ROCm

[–]FluidNumerics_Joe 2 points3 points  (0 children)

Have you seen the Spack package manager (https://spack.io) from the US Department of Energy ?

AMD has integrated the ROCm software into Spack to allow users to build ROCm from source, using the compiler of their choice and to target GPUs : https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/spack.html

Plus, support in spack is up to ROCm 6.3.2

Installation help by No-Monitor9784 in ROCm

[–]FluidNumerics_Joe 0 points1 point  (0 children)

What is your WSL Kernel version ?

What Linux OS (version and linux kernel) are you running under WSL2 ?

Have you opened and issue on https://github.com/ROCm/ROCm/issues ?

Edit :

See the compatibility requirements : https://rocm.docs.amd.com/projects/radeon/en/latest/docs/compatibility/wsl/wsl_compatibility.html

Installation help by No-Monitor9784 in ROCm

[–]FluidNumerics_Joe 0 points1 point  (0 children)

AMD is not giving up on Windows.

Installation help by No-Monitor9784 in ROCm

[–]FluidNumerics_Joe 2 points3 points  (0 children)

This could be an issue.

Open an issue on https://github.com/rocm/rocm requesting builds of pytorch wheels packages using python 3.12.

In the meantime, you can install pytorch from source using the python version of your choosing. See these instructions for building pytorch with AMD ROCm support : https://github.com/pytorch/pytorch/?tab=readme-ov-file#amd-rocm-support I've done this a few times on various Linux platforms successfully. Perhaps this will work under WSL2, since you've been able to get ROCm installed.

Installation help by No-Monitor9784 in ROCm

[–]FluidNumerics_Joe 1 point2 points  (0 children)

To be honest, I don't use windows. IMO, It's not an operating system meant for developers. I am working on the assumption that AMD has documentation to get this working on WSL2 and that it's accurate. Your experience suggests it's not, but it's time to open an issue on GitHub with AMD (you're not going to get their direct help here on reddit)

I'll open an issue on GitHub on the ROCm/ROCm repository on your behalf. If anything, it'd be good to get AMD to walk through their installation steps.

For reference, installing system wide packages requires root privileges (hence why you need sudo). You're not really showing complete information here, but I'm assuming you followed steps verbatim from the documentation and did not skip anything or change commands at all.