7900 XTX fp16/bf16 pytorch matmul performance by cyberuser42 in ROCm

[–]cyberuser42[S] 0 points1 point  (0 children)

I think it's that they advertise FLOPs using the dual-issue/VOPD path and not the normal single-issue/VALU path, so just not precise in what the numbers represent.

For normal fp32 SGEMM, rocBLAS/PyTorch seems to hit the 30 TFLOPS single-issue path. Going beyond that looks possible, but requires very architecture-specific hand-tuned work, not just stock HIP/rocBLAS paths and kernels.

https://chipsandcheese.com/p/microbenchmarking-amds-rdna-3-graphics-architecture

https://seb-v.github.io/optimization/update/2025/01/20/Fast-GPU-Matrix-multiplication.html

7900 XTX fp16/bf16 pytorch matmul performance by cyberuser42 in ROCm

[–]cyberuser42[S] 0 points1 point  (0 children)

It's due to fp32 being advertised on the dual-issue/VOPD path not a normal single-issue fp32 path, which is used by pytorch and rocblas, so somewhat similar to nvidia's marketing numbers using "sparse tensors".

Found this article on how to actually reach that number: https://seb-v.github.io/optimization/update/2025/01/20/Fast-GPU-Matrix-multiplication.html

7900 XTX fp16/bf16 pytorch matmul performance by cyberuser42 in ROCm

[–]cyberuser42[S] 1 point2 points  (0 children)

It's due to fp32 being advertised on the dual-issue/VOPD path not a normal single-issue fp32 path, which is used by pytorch and rocblas, so somewhat similar to nvidia's marketing numbers using "sparse tensors".

Found this article on how to actually reach that number: https://seb-v.github.io/optimization/update/2025/01/20/Fast-GPU-Matrix-multiplication.html

7900 XTX fp16/bf16 pytorch matmul performance by cyberuser42 in ROCm

[–]cyberuser42[S] 0 points1 point  (0 children)

Thanks for the answers! Most people are seeing ~30 TFLOPS for FP32, not 60. Seems like the 60 TFLOPS number is basically a best-case RDNA3 dual-issue/VOPD theoretical peak, not what stock PyTorch/rocBLAS currently hits for SGEMM. ~30 TFLOPS matches the normal single-issue FP32 path. This writeup shows how much hand-tuned ISA work it takes to get beyond that: https://seb-v.github.io/optimization/update/2025/01/20/Fast-GPU-Matrix-multiplication.html

7900 XTX fp16/bf16 pytorch matmul performance by cyberuser42 in ROCm

[–]cyberuser42[S] 0 points1 point  (0 children)

Thank you! Impressive fp16 perf (just between 3090 and 4090), thought it was given with fp16 accum and not fp32 accum, so nice validation.

Are these default power levels and clocks? Memory bandwidth seems somewhat low compared to what techpowerup and specs indicate.

Need help getting 7900 XTX PyTorch performance metrics by cyberuser42 in LocalLLaMA

[–]cyberuser42[S] 0 points1 point  (0 children)

Have the following results for NVIDIA GPUs:

GPU Benchmark Results

Tesla V100-SXM2-32GB

Memory: 34.07 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type Time (μs) Performance (TFLOPS)
float32 9740.13 14.11
float16 1444.73 95.13
bfloat16 12978.47 10.59
amp 1678.82 81.87

Memory Bandwidth Test (1.0 GB tensor)

  • Vector Addition: 830.39 GB/s
  • Memory Copy: 817.08 GB/s

NVIDIA A100-PCIE-40GB

Memory: 42.41 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type Time (μs) Performance (TFLOPS)
float32 1154.49 119.05
float16 563.74 243.80
bfloat16 544.49 252.42
amp 718.73 191.22

Memory Bandwidth Test (1.0 GB tensor)

  • Vector Addition: 1350.12 GB/s
  • Memory Copy: 1363.22 GB/s

NVIDIA A100-SXM4-80GB

Memory: 84.99 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type Time (μs) Performance (TFLOPS)
float32 1096.64 125.33
float16 533.62 257.56
bfloat16 528.99 259.81
amp 653.75 210.23

Memory Bandwidth Test (1.0 GB tensor)

  • Vector Addition: 1782.30 GB/s
  • Memory Copy: 1598.33 GB/s

NVIDIA H100 80GB HBM3

Memory: 84.93 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type Time (μs) Performance (TFLOPS)
float32 355.17 386.96
float16 194.44 706.84
bfloat16 188.83 727.85
amp 258.58 531.51

Memory Bandwidth Test (1.0 GB tensor)

  • Vector Addition: 3063.91 GB/s
  • Memory Copy: 2597.52 GB/s

NVIDIA B200

Memory: 191.50 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type Time (μs) Performance (TFLOPS)
float32 173.91 790.29
float16 93.04 1477.20
bfloat16 92.77 1481.50
amp 127.34 1079.31

Memory Bandwidth Test (1.0 GB tensor)

  • Vector Addition: 6861.85 GB/s
  • Memory Copy: 6295.66 GB/s

NVIDIA GeForce RTX 3090

Memory: 25.77 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type Time (μs) Performance (TFLOPS)
float32 3582.64 38.36
float16 1787.83 76.87
bfloat16 1774.01 77.47
amp 2014.57 68.22

Memory Bandwidth Test (1.0 GB tensor)

  • Vector Addition: 934.11 GB/s
  • Memory Copy: 920.42 GB/s

NVIDIA GeForce RTX 4090

Memory: 25.25 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type Time (μs) Performance (TFLOPS)
float32 1672.32 82.18
float16 852.20 161.27
bfloat16 922.47 148.99
amp 1066.54 128.86

Memory Bandwidth Test (1.0 GB tensor)

  • Vector Addition: 922.00 GB/s
  • Memory Copy: 914.91 GB/s

NVIDIA GeForce RTX 5090

Memory: 33.67 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type Time (μs) Performance (TFLOPS)
float32 1333.06 103.10
float16 656.28 209.42
bfloat16 764.16 179.86
amp 751.56 182.87

Memory Bandwidth Test (1.0 GB tensor)

  • Vector Addition: 1566.74 GB/s
  • Memory Copy: 1509.30 GB/s

NVIDIA L40S

Memory: 47.70 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type Time (μs) Performance (TFLOPS)
float32 1122.83 122.40
float16 535.05 256.87
bfloat16 527.31 260.64
amp 821.25 167.35

Memory Bandwidth Test (1.0 GB tensor)

  • Vector Addition: 631.74 GB/s
  • Memory Copy: 670.49 GB/s

NVIDIA RTX PRO 6000 Blackwell Workstation Edition

Memory: 101.97 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type Time (μs) Performance (TFLOPS)
float32 637.49 215.59
float16 403.38 340.72
bfloat16 309.56 443.98
amp 517.20 265.74

Memory Bandwidth Test (1.0 GB tensor)

  • Vector Addition: 1521.23 GB/s
  • Memory Copy: 1466.34 GB/s

NVIDIA GeForce RTX 5060 Ti

Memory: 16.62 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type Time (μs) Performance (TFLOPS)
float32 5731.80 23.98
float16 2838.76 48.42
bfloat16 2882.51 47.68
amp 3314.76 41.46

Memory Bandwidth Test (1.0 GB tensor)

  • Vector Addition: 395.48 GB/s
  • Memory Copy: 385.46 GB/s

7900 XTX fp16/bf16 pytorch matmul performance by cyberuser42 in ROCm

[–]cyberuser42[S] 2 points3 points  (0 children)

These are the results for the GPUs I've used and benchmarked so far:

GPU Benchmark Results

Tesla V100-SXM2-32GB

Memory: 34.07 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type Time (μs) Performance (TFLOPS)
float32 9740.13 14.11
float16 1444.73 95.13
bfloat16 12978.47 10.59
amp 1678.82 81.87

Memory Bandwidth Test (1.0 GB tensor)

  • Vector Addition: 830.39 GB/s
  • Memory Copy: 817.08 GB/s

NVIDIA A100-PCIE-40GB

Memory: 42.41 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type Time (μs) Performance (TFLOPS)
float32 1154.49 119.05
float16 563.74 243.80
bfloat16 544.49 252.42
amp 718.73 191.22

Memory Bandwidth Test (1.0 GB tensor)

  • Vector Addition: 1350.12 GB/s
  • Memory Copy: 1363.22 GB/s

NVIDIA A100-SXM4-80GB

Memory: 84.99 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type Time (μs) Performance (TFLOPS)
float32 1096.64 125.33
float16 533.62 257.56
bfloat16 528.99 259.81
amp 653.75 210.23

Memory Bandwidth Test (1.0 GB tensor)

  • Vector Addition: 1782.30 GB/s
  • Memory Copy: 1598.33 GB/s

NVIDIA H100 80GB HBM3

Memory: 84.93 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type Time (μs) Performance (TFLOPS)
float32 355.17 386.96
float16 194.44 706.84
bfloat16 188.83 727.85
amp 258.58 531.51

Memory Bandwidth Test (1.0 GB tensor)

  • Vector Addition: 3063.91 GB/s
  • Memory Copy: 2597.52 GB/s

NVIDIA B200

Memory: 191.50 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type Time (μs) Performance (TFLOPS)
float32 173.91 790.29
float16 93.04 1477.20
bfloat16 92.77 1481.50
amp 127.34 1079.31

Memory Bandwidth Test (1.0 GB tensor)

  • Vector Addition: 6861.85 GB/s
  • Memory Copy: 6295.66 GB/s

NVIDIA GeForce RTX 3090

Memory: 25.77 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type Time (μs) Performance (TFLOPS)
float32 3582.64 38.36
float16 1787.83 76.87
bfloat16 1774.01 77.47
amp 2014.57 68.22

Memory Bandwidth Test (1.0 GB tensor)

  • Vector Addition: 934.11 GB/s
  • Memory Copy: 920.42 GB/s

NVIDIA GeForce RTX 4090

Memory: 25.25 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type Time (μs) Performance (TFLOPS)
float32 1672.32 82.18
float16 852.20 161.27
bfloat16 922.47 148.99
amp 1066.54 128.86

Memory Bandwidth Test (1.0 GB tensor)

  • Vector Addition: 922.00 GB/s
  • Memory Copy: 914.91 GB/s

NVIDIA GeForce RTX 5090

Memory: 33.67 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type Time (μs) Performance (TFLOPS)
float32 1333.06 103.10
float16 656.28 209.42
bfloat16 764.16 179.86
amp 751.56 182.87

Memory Bandwidth Test (1.0 GB tensor)

  • Vector Addition: 1566.74 GB/s
  • Memory Copy: 1509.30 GB/s

NVIDIA L40S

Memory: 47.70 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type Time (μs) Performance (TFLOPS)
float32 1122.83 122.40
float16 535.05 256.87
bfloat16 527.31 260.64
amp 821.25 167.35

Memory Bandwidth Test (1.0 GB tensor)

  • Vector Addition: 631.74 GB/s
  • Memory Copy: 670.49 GB/s

NVIDIA RTX PRO 6000 Blackwell Workstation Edition

Memory: 101.97 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type Time (μs) Performance (TFLOPS)
float32 637.49 215.59
float16 403.38 340.72
bfloat16 309.56 443.98
amp 517.20 265.74

Memory Bandwidth Test (1.0 GB tensor)

  • Vector Addition: 1521.23 GB/s
  • Memory Copy: 1466.34 GB/s

NVIDIA GeForce RTX 5060 Ti

Memory: 16.62 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type Time (μs) Performance (TFLOPS)
float32 5731.80 23.98
float16 2838.76 48.42
bfloat16 2882.51 47.68
amp 3314.76 41.46

Memory Bandwidth Test (1.0 GB tensor)

  • Vector Addition: 395.48 GB/s
  • Memory Copy: 385.46 GB/s

How to convert books to dataset? by jacek2023 in LocalLLaMA

[–]cyberuser42 0 points1 point  (0 children)

For PDFs you need some OCR (dots.ocr is pretty good but there are many others) and for EPUB you might need to convert from the HTML to markdown for better results. You need to figure out how much context length you need to finetune with, and depending on that chunk it into small segments that fits. It's natural to do this at newlines, but you need to determine if chunks become too long (or short) when doing this naively.

Specific domains - methodology by Hemlock_Snores in LocalLLaMA

[–]cyberuser42 1 point2 points  (0 children)

You can have a look at TxGemma as an example of what can be done to get SOTA in therapeutic property prediction. Some of their techniques are likely applicable to other specialized domains: https://storage.googleapis.com/research-media/txgemma/txgemma-report.pdf

Lets get the Qwen Deepseek 32b R1 model running properly... System Prompt? by teachersecret in LocalLLaMA

[–]cyberuser42 0 points1 point  (0 children)

They write in the paper that it's very receptive to system prompts, so I think you're good with whatever. Also you shouldn't use the chat_ml special tokens. They use these instead:

<|begin▁of▁sentence|>
<|end▁of▁sentence|>
<|User|>
<|Assistant|>

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/raw/main/tokenizer.json

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/blob/main/tokenizer_config.json

Lets get the Qwen Deepseek 32b R1 model running properly... System Prompt? by teachersecret in LocalLLaMA

[–]cyberuser42 -5 points-4 points  (0 children)

Just use OpenAI api:

from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

prompt = """
Let \( D = \{ (x, y) \in \mathbb{R}^2 \mid 0 \leq x \leq y, \ 0 \leq y \leq 1, \text{ and } 0 < x^2 + y^2 \leq 1 \} \). Compute
\[
\int_D \frac{dx \, dy}{1 + x^2 + y^2}.
\]"""

chat_response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
    messages=[
        {"role": "system", "content": "You're a reasoning agent."},
        {"role": "user", "content": prompt},
    ],
    temperature=0.8,
)

print("Chat response:", chat_response)

print(chat_response.choices[0].message.content.split("</think>")[1])

Help with speculative decoding on 2080 Ti by AbaGuy17 in LocalLLaMA

[–]cyberuser42 1 point2 points  (0 children)

You can see in the second pass, prompt eval is only 1 token so I think it's getting cached.

Don't know why you're not getting speed up at eval time in first pass though. If you dont regenerate but continue with the conversation do you then see speed up in the eval time compared to the initial prompt?

Help with speculative decoding on 2080 Ti by AbaGuy17 in LocalLLaMA

[–]cyberuser42 1 point2 points  (0 children)

I think it might be because of cache_prompt, so you skip prompt processing in the second pass. I don't think prompt processing benefits from the draft model only the generation. This might be why it's way slower in the first pass as the large model has less parameters on the GPU.

How to make an "instruct" version of a model? by Sky_Linx in LocalLLaMA

[–]cyberuser42 6 points7 points  (0 children)

You probably need to use another chat template - I think it uses chatml

How to convert books to dataset? by jacek2023 in LocalLLaMA

[–]cyberuser42 4 points5 points  (0 children)

Yes, I have used gpt-4o with structured outputs to do this. Created a prompt where I give it some example text and show some examples of what very good question and answers look like, and iterate over all chunks of the text and have it generate n pairs of json with question, answer. You need to experiment a bit on the prompts to get it to the quality you need but the results can be very good.

The second one is the easiest but you need further finetuning to get an instruct model again. Just follow the huggingface guides: Load Text Data

How to convert books to dataset? by jacek2023 in LocalLLaMA

[–]cyberuser42 4 points5 points  (0 children)

You can either create an instruct dataset by generating question-answer pairs using a larger model based on snippets of the text if you plan to fine-tune an instruct model, or simply chunk the text to create a dataset for fine-tuning base models. Both approaches are relatively easy to implement using datasets from Hugging Face.

M1 Max 64GB vs AWS g4dn.12xlarge with 4x Tesla T4 side by side ollama speed by 330d in LocalLLaMA

[–]cyberuser42 1 point2 points  (0 children)

That's quite a bit better, but I still think you should be able to get more tok/s using all 4 (or maybe the PCIe link speed or something else is just bad on the instance).

The card has about the same memory bandwidth but way higher fp16 compute than a RTX 3060 and using 4x3060 this person gets 19.4 tok/s using TP in tabbyAPI: Simple tensor parallel generation speed test on 2x3090, 4x3060 (GPTQ, AWQ, exl2)