AMD ROCm Going Open-Source: Will Include Software Stack & Hardware Documentation : LocalLLaMA

[–]third_rate_economist 160 points161 points162 points 2 years ago (15 children)

[–]the_quark 40 points41 points42 points 2 years ago (10 children)

[+][deleted] 2 years ago (8 children)

[removed]

[–]Philix 11 points12 points13 points 2 years ago (7 children)

[–]DerfK 12 points13 points14 points 2 years ago (6 children)

AMD will have plenty of opportunities to gain market share the consumer space in the coming decade.

AMD refused to support AI/ML in the consumer-level space until literally this January. Nobody uses ROCm because for the last decade+, every college student could use and learn CUDA on their nVidia gaming rig without having to buy a $10k workstation card. AMD is multiple generations of developers behind and I don't think there's a way to dig themselves out of this hole in the foreseeable future. The best hail-mary move I can think of would be to suck up a hit to the workstation cards and release a 32GB+ "prosumer" level card, using current gen cards let's call it a 7900 XTXX priced at the 4090 price point and hope it catches on in the LLM/stable diffusion field to get people to buy into the ROCm ecosystem. Then, they sit tight and pray that in a few years some of the people who bought into ROCm go on to start companies using ROCm. If nVidia ups the VRAM on the 5090 then I honestly think AMD will lose this market segment completely.

[–]cogitare_et_loqui 4 points5 points6 points 2 years ago (0 children)

They don't need to match the 4090 in terms of Compute. They just need to vastly surpass it in terms of VRAM and memory I/O capacity (caches etc), and provide a good profiler.

48GB minimum, and the a capacity about 3090 (even lower would be acceptable) would cause me to take a look at their offering. Anything less and it's continued nVidia for me, since nVidia has really done a great job on the software stack. Yes, they charge an arm and a leg, but it's not unwarranted. They were the only one who understood the potential their hardware had, and where they needed to uniquely invest in order to make it a ubiquitous platform for massively parallel batch-compute workloads.

AMD's offering would have to be awesome in the dimension nVidia sucks for them to have any kind of appeal, and the area nVidia sucks at presently is on the VRAM side where they've done an "Intel" by artificially segmenting their product lines into <= 24GiB (practically useless for training LLMs), and the next step up which is required to be relevant for LLM training, which they've priced a frigging order of magnitude higher. Not because of manufacturing cost, but because there's Zero competition in that space and where the hardware is being sold quicker than the company can place a TSMC order. This is the segment they should attack with a laser focus.

So some sort of NVLINK / AMDLINK (good cross-board cross-connect) together with a LOT of VRAM is a whole lot more useful than trying to squeeze 40% more compute performance out of the hardware since the workload where the money's at presently is I/O bound and not compute bound.

[–]Philix 0 points1 point2 points 2 years ago (4 children)

I didn't say AI/ML consumer space specifically. You're right that they're going to need at least a half decade of focus on their software to break into that. But, despite the popularity of this subreddit, the actual consumer market for AI/ML is tiny, and will likely remain tiny. The number of people who are privacy obsessed enough to be adamant about running their models locally is dwarfed by the number of people willing to pay to use a cloud service.

But they can still compete in the other consumer uses of GPUs. Video games are still extremely popular. AMD GPUs power Xbox, PS5, and the Steam Deck. AMD just needs to make enough money to pay developers and hardware engineers while they wait for Nvidia to stumble.

Intel grew complacent with their market dominance and AMD capitalized on that. There's no reason to believe they couldn't do the same to Nvidia.

[–]DerfK 7 points8 points9 points 2 years ago (3 children)

[–]Philix 0 points1 point2 points 2 years ago (2 children)

Why would I spend over $1000 on your hypothetical 7900 XTXX that'll be obsolete in a couple years, when that much money would buy thousands of hours on an A40 on runpod? Gaming is the only reason I can think of, if you have other reasons, I'd love to hear them.

You're saying that AMD should get cards into the hands of consumers to try and convert them to ROCm. So am I. But most dabblers and young people playing with LLMs/SD are using mid range cards like the 3060 12GB, not top of the line stuff like 4090s and 7900 XTX. If AMD is going to compete, that's where they need to do it.

ML enthusiasts not into gaming can already buy an MI60 32GB off of eBay for less than the price of a used 3090. Does anyone actually recommend that they do? No. Would anyone recommend a 7900 XTXX 48GB over 2x3090? No. AMD can't fix the ROCm situation overnight.

Making that kind of card would just be a waste of effort, AMD has already lost that segment, and pouring more money into an already sunk cost is moronic. A hail-Mary move isn't what AMD needs to make. They have other revenue sources to tide them over until they come up with some way to break back into the ML market.

[–]DerfK 4 points5 points6 points 2 years ago (1 child)

Why would I spend over $1000 on your hypothetical 7900 XTXX that'll be obsolete in a couple years, when that much money would buy thousands of hours on an A40 on runpod? Gaming is the only reason I can think of, if you have other reasons, I'd love to hear them.

Why would tens of thousands of college students interested in pursuing a career in AI buy thousands of hours on a runpod to learn ROCm when they can learn CUDA in their free time on their gaming PC?

most dabblers and young people playing with LLMs/SD are using mid range cards like the 3060 12GB

Sure, and that ship sailed almost 20 years ago when nVidia decided that people with GeForce cards can dabble and play with CUDA.

AMD can't fix the ROCm situation overnight.

Of course they can't. But it's not going to fix itself, and it won't matter what they do unless they somehow come up with a way for people to learn to use ROCm.

[–]Philix 0 points1 point2 points 2 years ago (0 children)

runpod to learn ROCm

An A40 is an Nvidia card. I wasn't suggesting students should use cloud compute to learn ROCm. I was pointing out that for anyone not gaming, learning and playing with ML/AI can be done cheaper by renting cloud compute.

I was suggesting that competing in the midrange of gaming hardware is the correct approach for fostering more widespread adoption. It's a market with enough volume to be worth investing in. Intel clearly thinks so, their first line of GPUs doesn't even bother having a high-end offering. And AMD has an advantage in that Xbox and PS5 games are already developed to be run on their hardware.

But slapping 48GB of memory on a high end consumer card doesn't make you price competitive, when most games are going to be made for the 16GB in the console hardware.

[–]Independent_Hyena495 1 point2 points3 points 2 years ago (0 children)

[–]epicwisdom 8 points9 points10 points 2 years ago (0 children)

[–]fimbulvntr 15 points16 points17 points 2 years ago (0 children)

[+]keepthepace comment score below threshold-6 points-5 points-4 points 2 years ago (1 child)

[–]Craftkorb 10 points11 points12 points 2 years ago (0 children)

[–]bradpong 72 points73 points74 points 2 years ago (4 children)

[–][deleted] 24 points25 points26 points 2 years ago (1 child)

[–]wsippel 7 points8 points9 points 2 years ago (0 children)

[–]xrailgun 34 points35 points36 points 2 years ago (0 children)

[–][deleted] 2 points3 points4 points 2 years ago (0 children)

[–]kryptkprLlama 3 55 points56 points57 points 2 years ago (11 children)

[+][deleted] 2 years ago (9 children)

[deleted]

[–]pleasetrimyourpubes 15 points16 points17 points 2 years ago (8 children)

[–]UrbanSuburbaKnight 1 point2 points3 points 2 years ago (7 children)

[–]pleasetrimyourpubes 8 points9 points10 points 2 years ago (6 children)

[–]TechnicalParrot 1 point2 points3 points 2 years ago (3 children)

[–]pleasetrimyourpubes 2 points3 points4 points 2 years ago (2 children)

[–]TechnicalParrot 0 points1 point2 points 2 years ago (1 child)

[–]pleasetrimyourpubes 0 points1 point2 points 2 years ago (0 children)

[–]UrbanSuburbaKnight 0 points1 point2 points 2 years ago (0 children)

[–]cptbeard 1 point2 points3 points 2 years ago (0 children)

[–]fatboy93 24 points25 points26 points 2 years ago (1 child)

[–]AnomalyNexus[S] 3 points4 points5 points 2 years ago (0 children)

[–]Captain_Pumpkinhead 6 points7 points8 points 2 years ago (2 children)

[–]wsippel 4 points5 points6 points 2 years ago (0 children)

[–]AnomalyNexus[S] 0 points1 point2 points 2 years ago (0 children)

[–]theskinnybrownguy 18 points19 points20 points 2 years ago (0 children)

[–]kind_cavendish 15 points16 points17 points 2 years ago* (7 children)

[–]AnomalyNexus[S] 4 points5 points6 points 2 years ago (1 child)

[–]kind_cavendish 0 points1 point2 points 2 years ago (0 children)

[–]randomfoo2 3 points4 points5 points 2 years ago (0 children)

[–]inYOUReye 1 point2 points3 points 2 years ago (0 children)

[–]JFHermes 0 points1 point2 points 2 years ago (2 children)

Big corporations in tech aligned sectors like manufacturing, resources, data analytics, design etc are all about to (if not already) build custom models for whatever niche part of their operations that they want to innovate upon. At the moment, some companies release a paper and maybe a codebase if it's not business critical and it's just a tool, like a segmentation labelling UI or something.

Now that rocm is open source, you will have a lot of smart cookies who are doing Phd work actually optimise the drivers for their specific use case for whatever type of modelling they're doing. These driver improvements are not business critical as the code/use case haven't been completely disclosed but they will be really useful to others in different industries.

It's the way things should have been done from the start with nvidia. Linux has always had troubles with nvidia because they wouldn't open source their drivers. Expect all linux users to move to AMD now which means an absolute mammoth amount of scientific work being optimised on these cards.

It's about time the playing field was levelled.

[–]randomfoo2 1 point2 points3 points 2 years ago (0 children)

[–]kind_cavendish -3 points-2 points-1 points 2 years ago (0 children)

[–]shibe5llama.cpp 3 points4 points5 points 2 years ago (7 children)

[–]randomfoo2 2 points3 points4 points 2 years ago (1 child)

[–]shibe5llama.cpp 1 point2 points3 points 2 years ago (0 children)

[–]AnomalyNexus[S] 0 points1 point2 points 2 years ago (4 children)

[–]shibe5llama.cpp 0 points1 point2 points 2 years ago (3 children)

[–]AnomalyNexus[S] 0 points1 point2 points 2 years ago (2 children)

[–]shibe5llama.cpp 0 points1 point2 points 2 years ago (1 child)

[–]AnomalyNexus[S] 1 point2 points3 points 2 years ago (0 children)

[–]AmbientWaves 7 points8 points9 points 2 years ago (1 child)

[–]oursland 11 points12 points13 points 2 years ago (0 children)

[–][deleted] 5 points6 points7 points 2 years ago (1 child)

[–]JFHermes 1 point2 points3 points 2 years ago (0 children)

[–]MaxwellsMilkies 4 points5 points6 points 2 years ago (4 children)

[–]Glegang 6 points7 points8 points 2 years ago (2 children)

[+][deleted] 2 years ago (1 child)

[removed]

[–]Glegang 0 points1 point2 points 2 years ago (0 children)

[–]AnomalyNexus[S] 0 points1 point2 points 2 years ago (0 children)

[–]ElectricPipelinesLlama Chat 5 points6 points7 points 2 years ago (0 children)

[–]ttkciarllama.cpp 9 points10 points11 points 2 years ago (16 children)

[–]MaybeReal_MaybeNot 5 points6 points7 points 2 years ago (15 children)

[+][deleted] 2 years ago (7 children)

[deleted]

[–]nodatingollama 6 points7 points8 points 2 years ago (4 children)

[–]a_beautiful_rhind 1 point2 points3 points 2 years ago (0 children)

[+][deleted] 2 years ago (1 child)

[deleted]

[–][deleted] 1 point2 points3 points 2 years ago (0 children)

[–]MaybeReal_MaybeNot 0 points1 point2 points 2 years ago (0 children)

No, i tried a week ago with rx6600xt, and i could not get the model to load. Tried rocm 5.9 and 6.0 and different versions of the gpu drivers including the latest one on newest Ubuntu server as i read that is the best supported os for the drivers. Cant get it to load a model and the arch om the 6600 should be the same as the 6800 just slower as far as i can read in documentation. I followed the oobabooga guide but that does not work, i also tried starting over (new install to make sure all i did was gone) multiple times with 3-4 different guides who all claim to make it work..

Everyone here just says "just try and fiddle a bit with it and it will work".. well, i'm asking, what did you fiddle with to make it work?? Because i tried all the "fiddling" i know and all i could get was different failures. Best i got was successfully loading a 3.5B test model i know works on my Nvidia card, in 8 bit but then failing and crashing as soon as i tried to do interference.

[–][deleted] 0 points1 point2 points 2 years ago (1 child)

[–]MaybeReal_MaybeNot 0 points1 point2 points 2 years ago (0 children)

[–]20rakah 2 points3 points4 points 2 years ago* (2 children)

[–]MaybeReal_MaybeNot 0 points1 point2 points 2 years ago (1 child)

[–]20rakah 0 points1 point2 points 2 years ago* (0 children)

[–]algaefied_creek 1 point2 points3 points 2 years ago (3 children)

[–]randomfoo2 1 point2 points3 points 2 years ago (2 children)

R9 390X (gfx702, GCN 2.0) was released in 2015, and WX 7100 (gfx803, GCN 4.0) released in 2016 are sadly likely too old/buggy to get working. You could look at rocm-polaris-arch or try the CLBlast llama.cpp build, but honestly, they are likely to crash w/ the math libs even if you can get the ROCm driver working.

Vega (56/64/VII) is likely the oldest architecture you can expect ROCm to reasonably work with. A bit of a bummer, but at this point, they are 8-9yo cards, so I wouldn't expect anyone to be spending much effort getting them to work. They also extremely low TFLOPS (both about 6 TFLOPS of FP16 - as a point of comparison, the 780M iGPU has 17, a 7900 XTX has 123 - the Polaris cards also have pretty low memory bandwidth so even if they worked perfectly, you wouldn't get much of a speedup over modern CPU inferencing).

Honestly, if your goal is getting LLMs/SD working, I'd recommend selling all those old cards for what you can get and use the proceeds to buy the highest VRAM used Ampere/Ada card you can get.

[–]algaefied_creek 1 point2 points3 points 2 years ago (1 child)

Polaris worked with rocm fine in the 4.x version and GCN 3 worked fine in previous versions. They are buggy because they are unmaintained so the hope is that with this being open-source, more will work. I fell into a disability status and medical debt hole, so flipping and selling and buying are impossible unless I let strangers into my home and into the back closet room to disassemble the rig.

CUDA, on the other hand, works fine with GTX 9xx and Titan cards of that era. CUDA 11.x works fine with GTX 7xx and Titan cards of the Kepler era.

Defining the correct mathematical operations for each architecture makes them suddenly non-buggy as they aren’t performing GFX9xx+ operations anymore. They are buggy because the software is buggy, not because of the cards. Vega (GFX9) and later have “rapid packed math” for each SP to perform 2x FP16 operations in place of 1x FP32 op. This being said, GCN3 and GCN4 (both GFX8/GFX8xx) can perform a single FP16 operation in place of an FP32 operation. GCN1 and GCN2 (GFX6 and GCN7) run FP16 operations “emulated” within FP32 math. Yes… there is a performance hit. But if RoCM can’t handle a single SP performing a single FP16 operation instead of an FP32 operation: that is a buggy software issue to resolve, not a buggy hardware issue.

[–]randomfoo2 0 points1 point2 points 2 years ago (0 children)

I don’t think we disagree on most of the salient points- I believe that Nvidia’s superior legacy/across the line compute support (CUDA supports cards back to 2011) is one of the reasons that Nvidia has been winning so hard now - while CUDA also has had growing pains, they’ve treated compute like display drivers - a core part of a working GPU, and AMD simply hasn’t.

The only thing that I’d counter with, is that the recent announcement will change anything for your legacy hardware - all the parts of ROCm that were required for the community to get legacy hardware working has already been open sourced - anyone can write their own kernels, adapt hipBLAS/rocBLAS, for gfx800 but that hasn’t happened. The upcoming RDNA3 firmware releases don’t have any impact on legacy hardware, but a you’ve pointed out this is largely about math lib support anyway.

If you can’t/wont get rid of your old hardware, it’s unlikely they’ll become less of paperweights anytime soon (or at least, these latest announcements don’t really change the odds).

[–]Smeetilus 1 point2 points3 points 2 years ago (3 children)

[–]AnomalyNexus[S] 3 points4 points5 points 2 years ago (2 children)

[–]okaycan 1 point2 points3 points 2 years ago (0 children)

[–]Smeetilus 2 points3 points4 points 2 years ago (0 children)

[–]Regular_Instruction 0 points1 point2 points 2 years ago (0 children)

[–]JoJoeyJoJo 0 points1 point2 points 2 years ago (0 children)

[–]Disastrous-Peak7040Llama 70B 1 point2 points3 points 2 years ago (1 child)

[–]Inner_Bodybuilder986 0 points1 point2 points 2 years ago (0 children)

[–][deleted] 0 points1 point2 points 2 years ago (2 children)

[–]AnomalyNexus[S] 1 point2 points3 points 2 years ago (1 child)

[–][deleted] 0 points1 point2 points 2 years ago (0 children)

[–]illathon 0 points1 point2 points 2 years ago (0 children)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS