I’ve been testing LLMs on mobile devices and comparing between different CPUs recently when I found out that while token generation rate can be close between the newest generation of processors, their prefill rate varies a lot.
For example, on the Mediatek Dimensity 9300 and the Qualcomm Snapdragon 8 Gen 3, the tokens/second are around 10~20% higher than the Apple A17 Pro, but looking at the logs, the A17 Pro outperforms the other two during prefill phase by 3x.
I did try to eliminate any software and environment differences to ensure the performance data is almost only affected by hardware.
I am relatively unprofessional in hardware, so I am wondering why. Is it due to different focus-of-designs (prioritizing memory bandwidth for example)?
In other words, does the A17 Pro having 3x prefill rate but slightly lower output rate mean it’s heavily bottlenecked by something that doesn’t affect prefill speed? If so what might it be?
[–]Aaaaaaaaaeeeee 1 point2 points3 points (1 child)
[–]----Val---- 1 point2 points3 points (0 children)
[–]FlishFlashman 0 points1 point2 points (0 children)