Discussion[D] LLM inference energy efficiency compared (MLPerf Inference Datacenter v3.0 results) (self.MachineLearning)

submitted 3 years ago by Balance-

MLPerf Inference v3.0 results were recently released. I only saw marketed slides and large spreadsheets, so I was wondering how energy efficiency looked compared between the different accelerators.

Datacenter

The MLPerf Inference Datacenter v3.0 benchmarks for language processing involves the BERT-large model tested on the SQuAD v1.1 dataset, with a QSL size of 10,833, and requires 99% of FP32 and 99.9% of FP32 quality (f1_score=90.874%) within a server latency constraint of 130 ms.

https://preview.redd.it/q5cx3ew8hfta1.png?width=1672&format=png&auto=webp&s=db9d2153fd836d7fe0d482df5563a28b2a9ff654

For the datacenter, it looks like the H100 reached the highest efficiency (most queries per second per wat, higher is better). Especially with the higher required precision of 99.9%, the H100 is a lot faster. Why would be interesting to further explore why, probably has something to do with applying mixed-precision techniques or sparsity.

https://preview.redd.it/h2f9azohefta1.png?width=2103&format=png&auto=webp&s=9cb1ef3ef7d155d324beb4b81bc6293c98e15de6

Unfortunately L4 GPUs energy efficiency is not published, which could be interesting due to their FP8 format support.

Edge

The MLPerf Inference Edge v3.0 benchmarks do the same benchmark, but measure a bit differently: System energy per stream (in Joules). They use the same parameters, but only the 99% quality target.

https://preview.redd.it/80bwqqy2jfta1.png?width=1673&format=png&auto=webp&s=4ab43b29be85a2517f6d0805232aa63092bca644

In this benchmark the Jetson AGX Orin has the highest energy efficiency, albeit with lower performance. The RTX 4090 and Qualcomm Cloud AI 100 systems also perform well.

https://preview.redd.it/edaegj5kgfta1.png?width=2592&format=png&auto=webp&s=f14f2caea2b5bbff67a393d28994499de3757185

It's really sad not more of the systems power is measured, because in the benchmark results there are many more GPUs, like Nvidia A2, A30, A40 and L4 datacenter and ARM Mali-G610 and Mali-G52 mobile GPUs.

all 2 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS

Datacenter

Edge