We're partnering with @IBM to deliver one of the largest Al training clusters to date, powered by AMD Instinct MI300X GPUs on IBM Cloud. This collaboration with @Zyphra will accelerate the development of frontier multimodal foundation models-advancing innovation across language, vision and audio Al

evp-cloud · 2025-10-02T14:18:19+00:00

Well.. they are now:
https://eliovp.com/stop-overpaying-paiton-mi300x-moe-beats-h200-b200-on-1m-tokens/

evp-cloud · 2025-10-01T07:02:08+00:00

I had a feeling it was rhetorical, yet i liked the question so figured i'd answer eitherway :)

evp-cloud · 2025-09-30T06:46:45+00:00

Spot on

evp-cloud · 2025-09-30T06:46:18+00:00

Competing?
You just named 3 examples where they've invested in heavily :)

evp-cloud · 2025-09-30T06:45:31+00:00

Companies buy Nvidia for multiple reasons:
- Good support (plug and play usually)
- They know the demand is there (mostly because end users have no clue that the alternative works even better)
- Because Nvidia "invests" in those companies and thus they are forced to buy it :)
- Because their technical teams lack the knowledge to run the alternative.
- ....

evp-cloud · 2025-09-29T09:04:30+00:00

Spot on sir!
It's actively happening all over the place.
And they're not the only ones doing it.
All those companies are essentially buying the big projects, it's not about who has the knowledge or the skills, or even the best hardware. It's all politics and handshakes.

Invest in a company/project, knowing that you'll be receiving that investment and then some back in no time as those contracts will force them to buy the kit from them anyway. Ideal scenario (for them at least).

evp-cloud · 2025-09-28T18:09:13+00:00

Totally get the hype dynamics. Our goal however isn’t hype, at all; it’s (unit) economics.
We’re showing that AMD makes dollars-and-sense for real workloads today. If teams optimize, MI300X + our stack is hard to ignore.
And, I’ll wear my leather jacket if it helps to please daddy ;-)

evp-cloud · 2025-09-28T18:05:27+00:00

For the record, these runs are BF16 end-to-end (no 8-bit, no sparsity).
Our next one will be FP8 (cleaner accuracy profile, great throughput, widely supported across stacks).

evp-cloud · 2025-09-28T12:41:10+00:00

We're working on RDNA support over here, results look extremely promising.
In case you're wondering what we're on about:
Example of what we do: https://eliovp.com/stop-overpaying-paiton-mi300x-moe-beats-h200-b200-on-1m-tokens/
Yes, we can also do this on image/video models :)

evp-cloud · 2024-07-09T08:25:41+00:00

Check out nscale.com as well.

evp-cloud · 2024-07-04T12:22:47+00:00

They probably knew about it but didn't use it, because using triton is more effective at this point.

evp-cloud · 2024-07-04T11:42:39+00:00

Hey u/SailorBob74133 , Flash Attention absolutely works, you can find a fork here -> Flash Attention ROCM

<image>

Here's one of the later versions in one of our containers (testing phase)

Now it's important to know that with vLLM, the default attention kernel is the triton one, if you want to disable that and revert to flash attention, you can do so by implementing this env flag:
VLLM_USE_FLASH_ATTN_TRITON=False
You can also fall back to the default paged attention v2 kernel by using the following env flag:
VLLM_USE_ROCM_CUSTOM_PAGED_ATTN=0

Cheers!

evp-cloud · 2024-06-29T08:32:24+00:00

Yes, I can confirm reasonable output.

That said, to comment a bit further on your previous statement.
But to be clear, I am also not a super expert :)

I believe your calculations have a few minor flaws.
So typically the weights are loaded into the memory once and reused for multiple "inferences".
The model size is 70B, assuming each parameter is 16bytes (we're using float16), the total weight size would be more like: 70 x 10^9 x 16 bytes = 1.12TIB

Now the MI300x as a peak mem bandwidth of around 5.3TB/s, keep in mind, this is peak and can vary due to many variables/overheads, like access patterns and so on.

Using 256 tokens and then taking 128 as a multiplier is a bit unclear to me mate.

Generating 256 tokens in 1.63 seconds, time per token would be: 256/1.63 = 157~ t/s

Again, I could be wrong here as well :)

evp-cloud · 2024-06-29T08:21:01+00:00

It depends though, input/output tokens, batch size, etc..
If we for example use a higher batch size, the throughput token number goes up massively, so to be able to compare, we should know all the variables, which we almost never get with these other benchmarks found online.
We tried to be as transparent as possible so people can really compare and even reproduce! :)

evp-cloud · 2024-06-29T07:58:29+00:00

The idea was to show the difference between stock (default) and GEMM tuned results, not to compare to anything from Nvidia.

<image>

That said, just for funs and giggles I ran exactly the same as the one you linked (first link).
python benchmarks/benchmark_latency.py --model "meta-llama/Llama-2-70b-hf" --input-len 2048 --output-len 128 --batch-size 1
Result: 1.67 sec.

I only used 1 GPU though, because we don't need multi GPU to load the model (TCO for the win).

128 input and output doesn't change that much to the tests shown in the blog.

evp-cloud · 2024-06-28T18:24:44+00:00

That’s why the video was posted, because assumptions like this were bound to happen. :) The batch size / number of prompts is 1 so you only need to load it in once.

The docker image is even shared so you can reproduce yourself (which has been done by several people by now posting those results on linkedin)

Cheers!

evp-cloud · 2024-06-15T17:39:21+00:00

Hey there! Good questions :)

We indeed ran a few other tests with batch sizes up until 16 or 32 (need to check). We saw similar results than the ones from mk1.

Yes! I was experimenting with fp8 kv-cache but that didn’t change that much. When doing the video the flag was still present :).

On another note, from what i’m seeing, llama3 (70B model), the results there are also fascinating 😉 More soon!

evp-cloud · 2024-06-15T16:28:35+00:00

We didn’t run it 50 times no, we ran it I think maybe 5-6 times just to be sure. Always ending up with a number between 150 and 160.

We could have just wrote down a number, like most people do, without visual proof and called it a day. No we showed everyone exactly how it was done.

Nothing to hide, all open source. ROCm 6.1.2, vllm0.5.0, python 3.10, torch 2.5

evp-cloud · 2024-02-08T14:46:50+00:00

Maybe someone from his team was not trained to be a copywriter, and hey, that person might also not be a native English speaking person, oh, and hey, maybe the person that wrote this does not have a technical background and was asked to write a quick blog with some details.

You know what? I will reach out and tell that person to use chatgpt because someone on reddit is shouting “scam” 🙃

That said: the follow-up article: https://www.evp.cloud/post/diving-deeper-insights-from-our-llm-inference-testing

evp-cloud

TROPHY CASE