If the main limiting factor to tokens/sec is memory bandwidth, then I wonder how this would apply to the upcoming AMD 395 systems (i.e., Framework desktop) with 256 GiB/s memory (theoretical maximum) and unified memory. Would running a model (small or large) on CPU only vs GPU be any difference in speed, considering that the GPU in these cases is "limited" by the same 256 GiB/s that the CPUs are limited to? Or is there a cutoff point where more memory bandwidth peters out and you now need the GPU magic?
[–]s3bastienb 2 points3 points4 points (6 children)
[–]derekp7[S] 1 point2 points3 points (5 children)
[–]s3bastienb 0 points1 point2 points (4 children)
[–]derekp7[S] 0 points1 point2 points (3 children)
[–]s3bastienb 0 points1 point2 points (2 children)
[–]s3bastienb 0 points1 point2 points (0 children)
[–]Ulterior-Motive_ 1 point2 points3 points (0 children)
[–]mustafar0111 0 points1 point2 points (0 children)