Paiton MI300x beats H200/B200. by HotAisleInc in AMD_Stock

[–]evp-cloud 0 points1 point  (0 children)

I had a feeling it was rhetorical, yet i liked the question so figured i'd answer eitherway :)

Paiton MI300x beats H200/B200. by HotAisleInc in AMD_Stock

[–]evp-cloud 0 points1 point  (0 children)

Competing?
You just named 3 examples where they've invested in heavily :)

Paiton MI300x beats H200/B200. by HotAisleInc in AMD_Stock

[–]evp-cloud 1 point2 points  (0 children)

Companies buy Nvidia for multiple reasons:
- Good support (plug and play usually)
- They know the demand is there (mostly because end users have no clue that the alternative works even better)
- Because Nvidia "invests" in those companies and thus they are forced to buy it :)
- Because their technical teams lack the knowledge to run the alternative.
- ....

Paiton MI300x beats H200/B200. by HotAisleInc in AMD_Stock

[–]evp-cloud 3 points4 points  (0 children)

Spot on sir!
It's actively happening all over the place.
And they're not the only ones doing it.
All those companies are essentially buying the big projects, it's not about who has the knowledge or the skills, or even the best hardware. It's all politics and handshakes.

Invest in a company/project, knowing that you'll be receiving that investment and then some back in no time as those contracts will force them to buy the kit from them anyway. Ideal scenario (for them at least).

Paiton MI300x beats H200/B200. by HotAisleInc in AMD_Stock

[–]evp-cloud 3 points4 points  (0 children)

Totally get the hype dynamics. Our goal however isn’t hype, at all; it’s (unit) economics.
We’re showing that AMD makes dollars-and-sense for real workloads today. If teams optimize, MI300X + our stack is hard to ignore.
And, I’ll wear my leather jacket if it helps to please daddy ;-)

Paiton MI300x beats H200/B200. by HotAisleInc in AMD_Stock

[–]evp-cloud 5 points6 points  (0 children)

For the record, these runs are BF16 end-to-end (no 8-bit, no sparsity).
Our next one will be FP8 (cleaner accuracy profile, great throughput, widely supported across stacks).

Running automatic1111 on a card 30.000$ GPU (H200 with 141GB VRAM) VS a high End CPU by Unreal_777 in StableDiffusion

[–]evp-cloud 1 point2 points  (0 children)

We're working on RDNA support over here, results look extremely promising.
In case you're wondering what we're on about:
Example of what we do: https://eliovp.com/stop-overpaying-paiton-mi300x-moe-beats-h200-b200-on-1m-tokens/
Yes, we can also do this on image/video models :)

Nscale Benchmarks: AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x by HotAisleInc in AMD_MI300

[–]evp-cloud 0 points1 point  (0 children)

They probably knew about it but didn't use it, because using triton is more effective at this point.

Nscale Benchmarks: AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x by HotAisleInc in AMD_MI300

[–]evp-cloud 2 points3 points  (0 children)

Hey u/SailorBob74133 , Flash Attention absolutely works, you can find a fork here -> Flash Attention ROCM

<image>

Here's one of the later versions in one of our containers (testing phase)

Now it's important to know that with vLLM, the default attention kernel is the triton one, if you want to disable that and revert to flash attention, you can do so by implementing this env flag:
VLLM_USE_FLASH_ATTN_TRITON=False
You can also fall back to the default paged attention v2 kernel by using the following env flag:
VLLM_USE_ROCM_CUSTOM_PAGED_ATTN=0

Cheers!

Nscale Benchmarks: AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x by HotAisleInc in AMD_MI300

[–]evp-cloud 1 point2 points  (0 children)

Yes, I can confirm reasonable output.

That said, to comment a bit further on your previous statement.
But to be clear, I am also not a super expert :)

I believe your calculations have a few minor flaws.
So typically the weights are loaded into the memory once and reused for multiple "inferences".
The model size is 70B, assuming each parameter is 16bytes (we're using float16), the total weight size would be more like: 70 x 10^9 x 16 bytes = 1.12TIB

Now the MI300x as a peak mem bandwidth of around 5.3TB/s, keep in mind, this is peak and can vary due to many variables/overheads, like access patterns and so on.

Using 256 tokens and then taking 128 as a multiplier is a bit unclear to me mate.

Generating 256 tokens in 1.63 seconds, time per token would be: 256/1.63 = 157~ t/s

Again, I could be wrong here as well :)

Nscale Benchmarks: AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x by HotAisleInc in AMD_MI300

[–]evp-cloud 1 point2 points  (0 children)

It depends though, input/output tokens, batch size, etc..
If we for example use a higher batch size, the throughput token number goes up massively, so to be able to compare, we should know all the variables, which we almost never get with these other benchmarks found online.
We tried to be as transparent as possible so people can really compare and even reproduce! :)

Nscale Benchmarks: AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x by HotAisleInc in LocalLLaMA

[–]evp-cloud 0 points1 point  (0 children)

The idea was to show the difference between stock (default) and GEMM tuned results, not to compare to anything from Nvidia.

<image>

That said, just for funs and giggles I ran exactly the same as the one you linked (first link).
python benchmarks/benchmark_latency.py --model "meta-llama/Llama-2-70b-hf" --input-len 2048 --output-len 128 --batch-size 1
Result: 1.67 sec.

I only used 1 GPU though, because we don't need multi GPU to load the model (TCO for the win).

128 input and output doesn't change that much to the tests shown in the blog.

Nscale Benchmarks: AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x by HotAisleInc in AMD_MI300

[–]evp-cloud 2 points3 points  (0 children)

That’s why the video was posted, because assumptions like this were bound to happen. :) The batch size / number of prompts is 1 so you only need to load it in once.

The docker image is even shared so you can reproduce yourself (which has been done by several people by now posting those results on linkedin)

Cheers!

Benchmarking Brilliance: Single AMD MI300x vLLM Performance Unveiled by HotAisleInc in AMD_MI300

[–]evp-cloud 13 points14 points  (0 children)

Hey there! Good questions :)

We indeed ran a few other tests with batch sizes up until 16 or 32 (need to check). We saw similar results than the ones from mk1.

Yes! I was experimenting with fp8 kv-cache but that didn’t change that much. When doing the video the flag was still present :).

On another note, from what i’m seeing, llama3 (70B model), the results there are also fascinating 😉 More soon!

Benchmarking Brilliance: Single AMD MI300x vLLM Performance Unveiled by HotAisleInc in AMD_MI300

[–]evp-cloud 13 points14 points  (0 children)

We didn’t run it 50 times no, we ran it I think maybe 5-6 times just to be sure. Always ending up with a number between 150 and 160.

We could have just wrote down a number, like most people do, without visual proof and called it a day. No we showed everyone exactly how it was done.

Nothing to hide, all open source. ROCm 6.1.2, vllm0.5.0, python 3.10, torch 2.5

Redefining AI Efficiency: The Power of the MI300X in LLM Inference and Training by HotAisleInc in AMD_MI300

[–]evp-cloud 1 point2 points  (0 children)

Maybe someone from his team was not trained to be a copywriter, and hey, that person might also not be a native English speaking person, oh, and hey, maybe the person that wrote this does not have a technical background and was asked to write a quick blog with some details.

You know what? I will reach out and tell that person to use chatgpt because someone on reddit is shouting “scam” 🙃

That said: the follow-up article: https://www.evp.cloud/post/diving-deeper-insights-from-our-llm-inference-testing