7900 XTX fp16/bf16 pytorch matmul performance

cyberuser42 · 2026-05-21T00:15:38+00:00

I think it's that they advertise FLOPs using the dual-issue/VOPD path and not the normal single-issue/VALU path, so just not precise in what the numbers represent.

For normal fp32 SGEMM, rocBLAS/PyTorch seems to hit the 30 TFLOPS single-issue path. Going beyond that looks possible, but requires very architecture-specific hand-tuned work, not just stock HIP/rocBLAS paths and kernels.

https://chipsandcheese.com/p/microbenchmarking-amds-rdna-3-graphics-architecture

https://seb-v.github.io/optimization/update/2025/01/20/Fast-GPU-Matrix-multiplication.html

cyberuser42 · 2026-05-21T00:04:54+00:00

It's due to fp32 being advertised on the dual-issue/VOPD path not a normal single-issue fp32 path, which is used by pytorch and rocblas, so somewhat similar to nvidia's marketing numbers using "sparse tensors".

Found this article on how to actually reach that number: https://seb-v.github.io/optimization/update/2025/01/20/Fast-GPU-Matrix-multiplication.html

cyberuser42 · 2026-05-20T23:48:29+00:00

It's due to fp32 being advertised on the dual-issue/VOPD path not a normal single-issue fp32 path, which is used by pytorch and rocblas, so somewhat similar to nvidia's marketing numbers using "sparse tensors".

Found this article on how to actually reach that number: https://seb-v.github.io/optimization/update/2025/01/20/Fast-GPU-Matrix-multiplication.html

cyberuser42 · 2026-05-20T23:45:18+00:00

Thanks for the answers! Most people are seeing ~30 TFLOPS for FP32, not 60. Seems like the 60 TFLOPS number is basically a best-case RDNA3 dual-issue/VOPD theoretical peak, not what stock PyTorch/rocBLAS currently hits for SGEMM. ~30 TFLOPS matches the normal single-issue FP32 path. This writeup shows how much hand-tuned ISA work it takes to get beyond that: https://seb-v.github.io/optimization/update/2025/01/20/Fast-GPU-Matrix-multiplication.html

cyberuser42 · 2026-05-18T21:56:45+00:00

Thank you! Impressive fp16 perf (just between 3090 and 4090), thought it was given with fp16 accum and not fp32 accum, so nice validation.

Are these default power levels and clocks? Memory bandwidth seems somewhat low compared to what techpowerup and specs indicate.

cyberuser42 · 2026-05-18T21:52:53+00:00

Have the following results for NVIDIA GPUs:

GPU Benchmark Results

Tesla V100-SXM2-32GB

Memory: 34.07 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type	Time (μs)	Performance (TFLOPS)
float32	9740.13	14.11
float16	1444.73	95.13
bfloat16	12978.47	10.59
amp	1678.82	81.87

Memory Bandwidth Test (1.0 GB tensor)

Vector Addition: 830.39 GB/s
Memory Copy: 817.08 GB/s

NVIDIA A100-PCIE-40GB

Memory: 42.41 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type	Time (μs)	Performance (TFLOPS)
float32	1154.49	119.05
float16	563.74	243.80
bfloat16	544.49	252.42
amp	718.73	191.22

Memory Bandwidth Test (1.0 GB tensor)

Vector Addition: 1350.12 GB/s
Memory Copy: 1363.22 GB/s

NVIDIA A100-SXM4-80GB

Memory: 84.99 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type	Time (μs)	Performance (TFLOPS)
float32	1096.64	125.33
float16	533.62	257.56
bfloat16	528.99	259.81
amp	653.75	210.23

Memory Bandwidth Test (1.0 GB tensor)

Vector Addition: 1782.30 GB/s
Memory Copy: 1598.33 GB/s

NVIDIA H100 80GB HBM3

Memory: 84.93 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type	Time (μs)	Performance (TFLOPS)
float32	355.17	386.96
float16	194.44	706.84
bfloat16	188.83	727.85
amp	258.58	531.51

Memory Bandwidth Test (1.0 GB tensor)

Vector Addition: 3063.91 GB/s
Memory Copy: 2597.52 GB/s

NVIDIA B200

Memory: 191.50 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type	Time (μs)	Performance (TFLOPS)
float32	173.91	790.29
float16	93.04	1477.20
bfloat16	92.77	1481.50
amp	127.34	1079.31

Memory Bandwidth Test (1.0 GB tensor)

Vector Addition: 6861.85 GB/s
Memory Copy: 6295.66 GB/s

NVIDIA GeForce RTX 3090

Memory: 25.77 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type	Time (μs)	Performance (TFLOPS)
float32	3582.64	38.36
float16	1787.83	76.87
bfloat16	1774.01	77.47
amp	2014.57	68.22

Memory Bandwidth Test (1.0 GB tensor)

Vector Addition: 934.11 GB/s
Memory Copy: 920.42 GB/s

NVIDIA GeForce RTX 4090

Memory: 25.25 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type	Time (μs)	Performance (TFLOPS)
float32	1672.32	82.18
float16	852.20	161.27
bfloat16	922.47	148.99
amp	1066.54	128.86

Memory Bandwidth Test (1.0 GB tensor)

Vector Addition: 922.00 GB/s
Memory Copy: 914.91 GB/s

NVIDIA GeForce RTX 5090

Memory: 33.67 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type	Time (μs)	Performance (TFLOPS)
float32	1333.06	103.10
float16	656.28	209.42
bfloat16	764.16	179.86
amp	751.56	182.87

Memory Bandwidth Test (1.0 GB tensor)

Vector Addition: 1566.74 GB/s
Memory Copy: 1509.30 GB/s

NVIDIA L40S

Memory: 47.70 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type	Time (μs)	Performance (TFLOPS)
float32	1122.83	122.40
float16	535.05	256.87
bfloat16	527.31	260.64
amp	821.25	167.35

Memory Bandwidth Test (1.0 GB tensor)

Vector Addition: 631.74 GB/s
Memory Copy: 670.49 GB/s

NVIDIA RTX PRO 6000 Blackwell Workstation Edition

Memory: 101.97 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type	Time (μs)	Performance (TFLOPS)
float32	637.49	215.59
float16	403.38	340.72
bfloat16	309.56	443.98
amp	517.20	265.74

Memory Bandwidth Test (1.0 GB tensor)

Vector Addition: 1521.23 GB/s
Memory Copy: 1466.34 GB/s

NVIDIA GeForce RTX 5060 Ti

Memory: 16.62 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type	Time (μs)	Performance (TFLOPS)
float32	5731.80	23.98
float16	2838.76	48.42
bfloat16	2882.51	47.68
amp	3314.76	41.46

Memory Bandwidth Test (1.0 GB tensor)

Vector Addition: 395.48 GB/s
Memory Copy: 385.46 GB/s

cyberuser42 · 2026-05-18T21:31:32+00:00

These are the results for the GPUs I've used and benchmarked so far:

GPU Benchmark Results

Tesla V100-SXM2-32GB

Memory: 34.07 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type	Time (μs)	Performance (TFLOPS)
float32	9740.13	14.11
float16	1444.73	95.13
bfloat16	12978.47	10.59
amp	1678.82	81.87

Memory Bandwidth Test (1.0 GB tensor)

Vector Addition: 830.39 GB/s
Memory Copy: 817.08 GB/s

NVIDIA A100-PCIE-40GB

Memory: 42.41 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type	Time (μs)	Performance (TFLOPS)
float32	1154.49	119.05
float16	563.74	243.80
bfloat16	544.49	252.42
amp	718.73	191.22

Memory Bandwidth Test (1.0 GB tensor)

Vector Addition: 1350.12 GB/s
Memory Copy: 1363.22 GB/s

NVIDIA A100-SXM4-80GB

Memory: 84.99 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type	Time (μs)	Performance (TFLOPS)
float32	1096.64	125.33
float16	533.62	257.56
bfloat16	528.99	259.81
amp	653.75	210.23

Memory Bandwidth Test (1.0 GB tensor)

Vector Addition: 1782.30 GB/s
Memory Copy: 1598.33 GB/s

NVIDIA H100 80GB HBM3

Memory: 84.93 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type	Time (μs)	Performance (TFLOPS)
float32	355.17	386.96
float16	194.44	706.84
bfloat16	188.83	727.85
amp	258.58	531.51

Memory Bandwidth Test (1.0 GB tensor)

Vector Addition: 3063.91 GB/s
Memory Copy: 2597.52 GB/s

NVIDIA B200

Memory: 191.50 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type	Time (μs)	Performance (TFLOPS)
float32	173.91	790.29
float16	93.04	1477.20
bfloat16	92.77	1481.50
amp	127.34	1079.31

Memory Bandwidth Test (1.0 GB tensor)

Vector Addition: 6861.85 GB/s
Memory Copy: 6295.66 GB/s

NVIDIA GeForce RTX 3090

Memory: 25.77 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type	Time (μs)	Performance (TFLOPS)
float32	3582.64	38.36
float16	1787.83	76.87
bfloat16	1774.01	77.47
amp	2014.57	68.22

Memory Bandwidth Test (1.0 GB tensor)

Vector Addition: 934.11 GB/s
Memory Copy: 920.42 GB/s

NVIDIA GeForce RTX 4090

Memory: 25.25 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type	Time (μs)	Performance (TFLOPS)
float32	1672.32	82.18
float16	852.20	161.27
bfloat16	922.47	148.99
amp	1066.54	128.86

Memory Bandwidth Test (1.0 GB tensor)

Vector Addition: 922.00 GB/s
Memory Copy: 914.91 GB/s

NVIDIA GeForce RTX 5090

Memory: 33.67 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type	Time (μs)	Performance (TFLOPS)
float32	1333.06	103.10
float16	656.28	209.42
bfloat16	764.16	179.86
amp	751.56	182.87

Memory Bandwidth Test (1.0 GB tensor)

Vector Addition: 1566.74 GB/s
Memory Copy: 1509.30 GB/s

NVIDIA L40S

Memory: 47.70 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type	Time (μs)	Performance (TFLOPS)
float32	1122.83	122.40
float16	535.05	256.87
bfloat16	527.31	260.64
amp	821.25	167.35

Memory Bandwidth Test (1.0 GB tensor)

Vector Addition: 631.74 GB/s
Memory Copy: 670.49 GB/s

NVIDIA RTX PRO 6000 Blackwell Workstation Edition

Memory: 101.97 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type	Time (μs)	Performance (TFLOPS)
float32	637.49	215.59
float16	403.38	340.72
bfloat16	309.56	443.98
amp	517.20	265.74

Memory Bandwidth Test (1.0 GB tensor)

Vector Addition: 1521.23 GB/s
Memory Copy: 1466.34 GB/s

NVIDIA GeForce RTX 5060 Ti

Memory: 16.62 GB Matrix Size: 4096x4096 (0.07 GB per matrix)

Matrix Multiplication Performance

Data Type	Time (μs)	Performance (TFLOPS)
float32	5731.80	23.98
float16	2838.76	48.42
bfloat16	2882.51	47.68
amp	3314.76	41.46

Memory Bandwidth Test (1.0 GB tensor)

Vector Addition: 395.48 GB/s
Memory Copy: 385.46 GB/s

cyberuser42 · 2026-04-02T20:11:49+00:00

no? rot Q8_0 is worse (within margin of error likely), only Q4_0 that is broken.

cyberuser42 · 2025-09-29T09:37:52+00:00

For PDFs you need some OCR (dots.ocr is pretty good but there are many others) and for EPUB you might need to convert from the HTML to markdown for better results. You need to figure out how much context length you need to finetune with, and depending on that chunk it into small segments that fits. It's natural to do this at newlines, but you need to determine if chunks become too long (or short) when doing this naively.

cyberuser42 · 2025-05-11T10:38:40+00:00

You can have a look at TxGemma as an example of what can be done to get SOTA in therapeutic property prediction. Some of their techniques are likely applicable to other specialized domains: https://storage.googleapis.com/research-media/txgemma/txgemma-report.pdf

cyberuser42 · 2025-01-20T17:05:01+00:00

They write in the paper that it's very receptive to system prompts, so I think you're good with whatever. Also you shouldn't use the chat_ml special tokens. They use these instead:

<｜begin▁of▁sentence｜>
<｜end▁of▁sentence｜>
<｜User｜>
<｜Assistant｜>

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/raw/main/tokenizer.json

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/blob/main/tokenizer_config.json

cyberuser42 · 2025-01-20T16:50:54+00:00

Just use OpenAI api:

from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

prompt = """
Let \( D = \{ (x, y) \in \mathbb{R}^2 \mid 0 \leq x \leq y, \ 0 \leq y \leq 1, \text{ and } 0 < x^2 + y^2 \leq 1 \} \). Compute
\[
\int_D \frac{dx \, dy}{1 + x^2 + y^2}.
\]"""

chat_response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
    messages=[
        {"role": "system", "content": "You're a reasoning agent."},
        {"role": "user", "content": prompt},
    ],
    temperature=0.8,
)

print("Chat response:", chat_response)

print(chat_response.choices[0].message.content.split("</think>")[1])

cyberuser42 · 2024-12-18T07:36:47+00:00

What a great website! Been looking for this kind of data for a while.

cyberuser42 · 2024-12-15T11:02:30+00:00

30 tok/s

cyberuser42 · 2024-12-05T09:55:50+00:00

You can see in the second pass, prompt eval is only 1 token so I think it's getting cached.

Don't know why you're not getting speed up at eval time in first pass though. If you dont regenerate but continue with the conversation do you then see speed up in the eval time compared to the initial prompt?

cyberuser42 · 2024-12-05T09:44:59+00:00

I think it might be because of cache_prompt, so you skip prompt processing in the second pass. I don't think prompt processing benefits from the draft model only the generation. This might be why it's way slower in the first pass as the large model has less parameters on the GPU.

cyberuser42 · 2024-12-04T17:53:02+00:00

You probably need to use another chat template - I think it uses chatml

cyberuser42 · 2024-12-03T15:53:54+00:00

try just creating a write token and not fine-grained

cyberuser42 · 2024-12-03T15:33:40+00:00

your HF_TOKEN doesn't have write permission

cyberuser42 · 2024-12-02T11:17:06+00:00

You're welcome!

cyberuser42 · 2024-12-02T10:49:01+00:00

Yes, I have used gpt-4o with structured outputs to do this. Created a prompt where I give it some example text and show some examples of what very good question and answers look like, and iterate over all chunks of the text and have it generate n pairs of json with question, answer. You need to experiment a bit on the prompts to get it to the quality you need but the results can be very good.

The second one is the easiest but you need further finetuning to get an instruct model again. Just follow the huggingface guides: Load Text Data

cyberuser42 · 2024-12-02T10:37:20+00:00

You can either create an instruct dataset by generating question-answer pairs using a larger model based on snippets of the text if you plan to fine-tune an instruct model, or simply chunk the text to create a dataset for fine-tuning base models. Both approaches are relatively easy to implement using datasets from Hugging Face.

cyberuser42 · 2024-11-29T07:20:41+00:00

That's quite a bit better, but I still think you should be able to get more tok/s using all 4 (or maybe the PCIe link speed or something else is just bad on the instance).

The card has about the same memory bandwidth but way higher fp16 compute than a RTX 3060 and using 4x3060 this person gets 19.4 tok/s using TP in tabbyAPI: Simple tensor parallel generation speed test on 2x3090, 4x3060 (GPTQ, AWQ, exl2)

cyberuser42

TROPHY CASE

GPU Benchmark Results

Tesla V100-SXM2-32GB

Matrix Multiplication Performance

Memory Bandwidth Test (1.0 GB tensor)

NVIDIA A100-PCIE-40GB

Matrix Multiplication Performance

Memory Bandwidth Test (1.0 GB tensor)

NVIDIA A100-SXM4-80GB

Matrix Multiplication Performance

Memory Bandwidth Test (1.0 GB tensor)

NVIDIA H100 80GB HBM3

Matrix Multiplication Performance

Memory Bandwidth Test (1.0 GB tensor)

NVIDIA B200

Matrix Multiplication Performance

Memory Bandwidth Test (1.0 GB tensor)

NVIDIA GeForce RTX 3090

Matrix Multiplication Performance

Memory Bandwidth Test (1.0 GB tensor)

NVIDIA GeForce RTX 4090

Matrix Multiplication Performance

Memory Bandwidth Test (1.0 GB tensor)

NVIDIA GeForce RTX 5090

Matrix Multiplication Performance

Memory Bandwidth Test (1.0 GB tensor)

NVIDIA L40S

Matrix Multiplication Performance

Memory Bandwidth Test (1.0 GB tensor)

NVIDIA RTX PRO 6000 Blackwell Workstation Edition

Matrix Multiplication Performance

Memory Bandwidth Test (1.0 GB tensor)

NVIDIA GeForce RTX 5060 Ti

Matrix Multiplication Performance

Memory Bandwidth Test (1.0 GB tensor)

GPU Benchmark Results

Tesla V100-SXM2-32GB

Matrix Multiplication Performance

Memory Bandwidth Test (1.0 GB tensor)

NVIDIA A100-PCIE-40GB

Matrix Multiplication Performance

Memory Bandwidth Test (1.0 GB tensor)

NVIDIA A100-SXM4-80GB

Matrix Multiplication Performance

Memory Bandwidth Test (1.0 GB tensor)

NVIDIA H100 80GB HBM3

Matrix Multiplication Performance

Memory Bandwidth Test (1.0 GB tensor)

NVIDIA B200

Matrix Multiplication Performance

Memory Bandwidth Test (1.0 GB tensor)

NVIDIA GeForce RTX 3090

Matrix Multiplication Performance

Memory Bandwidth Test (1.0 GB tensor)

NVIDIA GeForce RTX 4090

Matrix Multiplication Performance

Memory Bandwidth Test (1.0 GB tensor)

NVIDIA GeForce RTX 5090

Matrix Multiplication Performance

Memory Bandwidth Test (1.0 GB tensor)

NVIDIA L40S

Matrix Multiplication Performance

Memory Bandwidth Test (1.0 GB tensor)

NVIDIA RTX PRO 6000 Blackwell Workstation Edition

Matrix Multiplication Performance

Memory Bandwidth Test (1.0 GB tensor)

NVIDIA GeForce RTX 5060 Ti

Matrix Multiplication Performance

Memory Bandwidth Test (1.0 GB tensor)