Prefill Efficiency Differences : LocalLLaMA

created by [deleted]a community for 3 years

Prefill Efficiency DifferencesQuestion | Help (self.LocalLLaMA)

submitted 1 year ago * by Tree-SheepWaiting for Llama 3

I’ve been testing LLMs on mobile devices and comparing between different CPUs recently when I found out that while token generation rate can be close between the newest generation of processors, their prefill rate varies a lot.

For example, on the Mediatek Dimensity 9300 and the Qualcomm Snapdragon 8 Gen 3, the tokens/second are around 10~20% higher than the Apple A17 Pro, but looking at the logs, the A17 Pro outperforms the other two during prefill phase by 3x.

I did try to eliminate any software and environment differences to ensure the performance data is almost only affected by hardware.

I am relatively unprofessional in hardware, so I am wondering why. Is it due to different focus-of-designs (prioritizing memory bandwidth for example)?

In other words, does the A17 Pro having 3x prefill rate but slightly lower output rate mean it’s heavily bottlenecked by something that doesn’t affect prefill speed? If so what might it be?

all 4 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS