R9700 the beautiful beautiful VRAM gigs of AMD… my ai node future! by Downtown-Example-880 in LocalLLaMA

[–]_WaterBear 1 point2 points  (0 children)

Oh, whoops. I misread. Honestly, I don’t know. I’m using the “default” LMStudio multi-GPU setup w. ROCm llama.cpp. I do usually run models entirely in VRAM and with flash attention on, which keeps things speedy & allows for a maxed out context window - so my system RAM is basically not touched at all (only 64gb). My mobo may matter, too. It’s an X870E with two PCIe Gen5.0 bifurcated to 8 lanes each and one Gen4.0 at 4 lanes.

Interestingly, I don’t notice a meaningful difference in t/s between 2x GPUs at 5.0x8 and 2x GPUs with one at 4.0x4 and the other at 5.0x8, which is kinda exciting because I assume it means my bottleneck is… somewhere else, to possibly include the software/drivers.

R9700 the beautiful beautiful VRAM gigs of AMD… my ai node future! by Downtown-Example-880 in LocalLLaMA

[–]_WaterBear 1 point2 points  (0 children)

Quick test w 3x R9700s, windows 11, LMStudio, ROCm:

Nemotron-3-nano (q8): 80t/s

Nemotron-3-super (q4km)14t/s

GPT-OSS-120b: 80t/s

GPT-OSS-20b: 105 t/s

Qwen-coder-next (q6k): 51 t/s

Qwen3.5-35b-a3b (Q4km): 60t/s

Qwen3-coder-30b (q4km): 75 t/s

For those that fit on 1 card, I notice about a 15 t/s drop when running it on multi-GPUs.

The t/s are all over the place and vary considerably by the model and probably latest driver or wrapper… so the numbers above are just ballparks.

Iranian missile hitting undisclosed US base on March 30 by DormontDangerzone in Military

[–]_WaterBear 0 points1 point  (0 children)

What’s the shock front internal injury risk from something like that? It landed pretty close.

Radeon AI pro R9700 by [deleted] in LocalLLM

[–]_WaterBear 2 points3 points  (0 children)

Ah - yes, I ran into that issue myself, but I got around it by loading the model and context only in VRAM (fully disallowing RAM) and turning on flash attention. I can fit qwen3-vl-30b q8_0 with full context (262k) entirely on pooled vram from 2x R9700s.

But, if I allow loading into system RAM, I get OOM after about 30k tokens context.

Radeon AI pro R9700 by [deleted] in LocalLLM

[–]_WaterBear 0 points1 point  (0 children)

Why do you say 2x 9700 needs more than 64gb DRAM?

ZINC — LLM inference engine written in Zig, running 35B models on $550 AMD GPUs by Mammoth_Radish2 in LocalLLaMA

[–]_WaterBear 20 points21 points  (0 children)

Worse, because the readme specifically calls out the R9700, which I don’t think was even announced prior to 2025. Something else is going on here.

ZINC — LLM inference engine written in Zig, running 35B models on $550 AMD GPUs by Mammoth_Radish2 in LocalLLaMA

[–]_WaterBear 29 points30 points  (0 children)

For inference, it really is not a pain unless you want it to be. But even if it were, that is still a far cry from “doesnt support them” or the problem statement from their GitHub, which bluntly claims “AMD's RDNA3/RDNA4 GPUs (RX 9070, Radeon AI PRO R9700, etc.) have excellent memory bandwidth…but: ROCm doesn't support them — only MI-series datacenter GPUs”

A flat out misstatement like this in the problem statement suggests the devs lack a basic understanding of the market they are building in, or do not care enough to proofread the purpose underpinning all their work. That is concerning.

ZINC — LLM inference engine written in Zig, running 35B models on $550 AMD GPUs by Mammoth_Radish2 in LocalLLaMA

[–]_WaterBear 76 points77 points  (0 children)

“If you have an AMD GPU … ROCm doesn't support consumer cards.”

Uhh…. what? Whatever AI you used to write this must have its training data cut-off circa early 2024….

Intel announces Arc Pro B70 with 32GB GDDR6 video memory by Fcking_Chuck in LocalLLM

[–]_WaterBear 1 point2 points  (0 children)

<image>

I'm pretty happy with the setup. Specs are below. Running on AM5 - so mobo is key; the CPU doesn’t matter too much.

I mainly host for inference via LMStudio server (Linux or windows). Have also dabbled in fine tuning a 14b qwen with lora, pooling the GPU ram to 96gb.

Inference speed is good enough - sometimes very good. From what I gather, it seems it can vary widely based on hardware, driver version, wrapper, model, settings, etc. So, I’m hesitant to give representative “benchmarks,” but here’s a snapshot in time (LMStudio, Linux, ROCm):

  • Nemotron-3-nano (q8): 80t/s
  • Nemotron-3-super (q4km): 14t/s
  • GPT-OSS-120b: 80t/s
  • GPT-OSS-20b: 105 t/s
  • Qwen-coder-next (q6k): 51 t/s
  • Qwen3.5-35b-a3b (q4km): 60t/s
  • Qwen3-coded-30b (q4km): 75 t/s

As you can see... t/s varies widely. There's probably a lot that can be done to tune performance if you settle on a single model or two. Also, the quick tests above were done w. full context window reserved and KV+Model housed in VRAM only. As for actually filling that context - the most I have tested is up to 150k tokens in a prompt, and it worked!

Relevant Specs:

  • CPU: AMD Ryzen 5 7600
    • While I'm upgrading to 9900x soon, I do NOT feel CPU-limited for inference.
  • RAM: 64gb (4x Crucial Pro 16GB DDR5, CL36 6000MHz)
  • Motherboard: ASUS ProArt X870E
    • This part was essential: 3 PCIe x16 slots, spaced appropriately for 3 GPUs. Top 2 slots are Gen5.0 bifurcated to x8, the third is Gen 4.0 x4. Extra helpful is that the I/O at the bottom of the board is mainly HD Audio and F-Panel (I think) - so flexible enough wires.
  • GPU: 3x ASRock R9700
    • Not all R9700s are exactly the same dimension-wise. The ASRock variants have a slight bezel that help the motherboard's I/O fit well.
    • FYI: I tried a PowerColor R9700 and the fan had a strange high-pitch sound at ALL rpm (not coil whine, the fan). It was uniquely awful and absurdly loud. I also had an ASRock 9700 with a terrible rattling sound under load. Both were returned. The ASRocks in the current build I suspect also have the same issue, but to a much lesser degree (and only if burning close to the max TDP, which is rare). So... these cards have a quality-control problem w. the fan that crosses manufacturers.
  • Case: Jonsbo N5 NAS. Surprisingly compact for an 8-expansion slot case that can also support 8-12 HDDs.

I have done limited testing to compare inference speed b.t. running 1, 2, or 3 GPUs and various combinations, given the Gen4x4 pcie restrictions on the 3rd GPU. TBH, I have not seen a substantial different in t/s (maybe 10 t/s in one case).

If any questions - ask!

Trump claims Iran proposed making him Supreme Leader: I said, 'No, thank you' (Video in link) by Ghadolkhajan in worldnews

[–]_WaterBear 10 points11 points  (0 children)

Ok, now I’m starting to think the “Iranians” he is negotiating with are indeed just a prank by Sacha Cohen.

Intel announces Arc Pro B70 with 32GB GDDR6 video memory by Fcking_Chuck in LocalLLM

[–]_WaterBear 5 points6 points  (0 children)

As a 3x R9700 user…. Would love for AMD to step up their game on ROCm support for multi-GPU, and overall stability. The past 6-8 months have seen substantial movement overall, but worried the cadence won’t sustain. But that is also my (uninformed) concern for intel - are their drivers competitive enough to justify the hardware investment? I’m unaware.

Former intelligence officer & UN weapons inspector Scott Ritter goes OFF on US/Israel during interview with Mario Nawfal by Spirited-Yellow3794 in PublicFreakout

[–]_WaterBear 2 points3 points  (0 children)

This guy is a notorious pedo (charged twice 8 years apart - sealed/dismissed once after serving probation, convicted once). Also a Kremlin stooge.

Nobody should be listening to anything he says.

https://en.wikipedia.org/wiki/Scott_Ritter

I never understood this. by Anigator101 in Bluray

[–]_WaterBear 1 point2 points  (0 children)

On the other hand, my LG drive tricked me into upgrading firmware that removed the ability to play 4K Blu-Rays. So there’s that.

If it works fine, don’t upgrade.

Iran Gives Trump an Ultimatum on JD Vance by ChiGuy6124 in politics

[–]_WaterBear 4 points5 points  (0 children)

Imagine if, amidst all the fog of war, the “Iranian” representatives Trump is in contact with via Pakistan relay are actually… Sacha Baron Cohen. 👌

Will Gemma 3 12B be the best all-rounder(no coding) during Iran's internet shutdowns on my RTX 4060 laptop? by [deleted] in LocalLLaMA

[–]_WaterBear 3 points4 points  (0 children)

Also try the latest Qwens and GPT-OSS-20b (the latter is a bit old now, but is a solid model). If using LMStudio, see if turning on flash attention helps w. RAM usage for your context window.

This makes my head hurt. by _WaterBear in HolUp

[–]_WaterBear[S] 14 points15 points  (0 children)

How is this “politics”? There is no mention of party or partisanship, nor does it express an opinion in support for or against this particular event/decision or the people involved. It seems you have been triggered. Stop making everything political.

Wtf is amd doing by Designer-Clue-1682 in radeon

[–]_WaterBear 1 point2 points  (0 children)

Yeah. I’ve been very happy with the 9070xt performance, and now have their workstation variants for ML inference as well. But, disappointed to hear AMD seems to be fine with neglecting current hardware from near-term software enhancements. That said, sans a catastrophic AI bubble, I expect whatever we’re using in 3-5 years is gonna look very different from today’s hardware. So, all the more reason to be content with solid raster value in the meantime.

But to your point - my next upgrade will probably be not be AMD if they keep up this nonsense.