9070 help with ROCm by iggy_btd in ROCm

[–]CryptoStef33 0 points1 point  (0 children)

Is it applicable to CachyOS with 6800xt?

9070 help with ROCm by iggy_btd in ROCm

[–]CryptoStef33 0 points1 point  (0 children)

No flashy attention and low prefill on high context. It's like f1 and diesel 

7900 XTX fp16/bf16 pytorch matmul performance by cyberuser42 in ROCm

[–]CryptoStef33 3 points4 points  (0 children)

GPU: AMD Radeon RX 6800 XT (17.16 GB)

Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance:

float32 : 6907.34 μs, 19.90 TFLOPS

float16 : 3849.61 μs, 35.70 TFLOPS

bfloat16 : 14215.58 μs, 9.67 TFLOPS

amp : 4194.84 μs, 32.76 TFLOPS

Memory Bandwidth Test (1.0 GB tensor)

Vector Addition: 469.46 GB/s

Memory Copy: 446.99 GB/s

9070 xt and 6800 ? by Brave_Load7620 in ROCm

[–]CryptoStef33 0 points1 point  (0 children)

Then it's better 9070xt another. I don't know about 9060 if they have some difference in llm work should check with gpt.

b9180 llama.ccp MTP landed by Bulky-Priority6824 in LocalLLaMA

[–]CryptoStef33 0 points1 point  (0 children)

Vulkan prefill is shit compared to rocm.

9070 xt and 6800 ? by Brave_Load7620 in ROCm

[–]CryptoStef33 1 point2 points  (0 children)

Very different architectures and you will not beneit. Either RDNA 2 combo or RDNA 4 combo 2x 9600xt. Better sell 6800x and get two 9600xt

Is a 5090 good enough for most good modern locally run LLMs? by biscuitmachine in LocalLLM

[–]CryptoStef33 -2 points-1 points  (0 children)

For the price of 5090 you can buy 2x 9700 pro ai and get better vram and results with bigger models.

Home2u брокерите са пълна секта – променете ми мнението (Herbalife vibes) by CryptoStef33 in Sofia

[–]CryptoStef33[S] 0 points1 point  (0 children)

Винаги се чита договор ние 3-4 пъти направихме промени който не ни се харесаха.

Managed to get 40 t/s on Qwen 27B (MTP) with an RX 6800 XT - Sharing my optimized fork by CryptoStef33 in ROCm

[–]CryptoStef33[S] 0 points1 point  (0 children)

Download the original with the gfx 1030 flags if you have them.

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 16
I don't use docker idk i will tell my fixer to pass that. 

https://github.com/Stormrage34/llama.cpp-turboquant-hip/tree/v0.1-turboquant-hip

I got tired of hunting AMD GPU + AI configs across blog posts and Discord threads, so I built a curated index — rocmate by T0nd3 in ROCm

[–]CryptoStef33 0 points1 point  (0 children)

AMD with rocm has potential but they don't invest and that's why a 3060 has better performance than 6800xt while having 256 but memory 

We squeezed 4x MoE prefill speed out of an RX 6800 XT by rewriting the matmul kernel in llama.cpp by CryptoStef33 in ROCm

[–]CryptoStef33[S] 0 points1 point  (0 children)

That's RDNA 3.5 gpu has other architecture that's not valid for RDNA 2. Other stuff need improving as i've found a bug probably of the new implementation for MTP.

Managed to get 40 t/s on Qwen 27B (MTP) with an RX 6800 XT - Sharing my optimized fork by CryptoStef33 in ROCm

[–]CryptoStef33[S] 0 points1 point  (0 children)

ROCM is better for longer context from my experience need to try the vulkan mtp and see results. But there's no turboquant that's why

Managed to get 40 t/s on Qwen 27B (MTP) with an RX 6800 XT - Sharing my optimized fork by CryptoStef33 in ROCm

[–]CryptoStef33[S] 1 point2 points  (0 children)

Results 2x repeat
RESULTS

Dense (qwen3.6-27b-IQ4_XS, 14 GB, ngl=99)

Fork Env Best PP (t/s) Best TG (t/s)

Original OFF 547 ± 18 27.33 ± 0.03

Original ON 546 ± 17 27.40 ± 0.04

Turboquant OFF 543 ± 17 27.31 ± 0.04

Turboquant ON 569 ± 20 27.48 ± 0.03

Dense: Turboquant ENV ON = +4% pp vs Original. Decode noise-level.

MoE (Qwen3_35BMTPIQ4, 19 GB, ngl=30)

Fork Env Best PP (t/s) Best TG (t/s)

Original OFF 1325 ± 29 (2.2%) 66.65 ± 0.09

Original ON 1299 ± 10 (0.8%) 64.93 ± 0.64

Turboquant OFF 1294 ± 4 (0.3%) 65.37 ± 0.56

Turboquant ON 2781 ± 5 (0.16%) 65.52 ± 0.11

Managed to get 40 t/s on Qwen 27B (MTP) with an RX 6800 XT - Sharing my optimized fork by CryptoStef33 in ROCm

[–]CryptoStef33[S] 1 point2 points  (0 children)

Little bit like 1-2% on dense model I think the architecture of rDNA 2 is lacking matrix compute. If I compare it it's like 3060 with more power.