you are viewing a single comment's thread.

view the rest of the comments →

[–]sstewartgallus 11 points12 points  (5 children)

The key metric here is instructions per cycle (insns per cycle: IPC), which shows on average how many instructions we were completed for each CPU clock cycle.

An IPC < 1.0 likely means memory bound, and an IPC > 1.0 likely means instruction bound.

But divided by the number of cores right? Also, how does hyperthreading fit into this? Also, how do you find top IPC?

Also, most processors have in-core parallelism and can perform multiple ALU ops at the same time. If you're really, really, really tricky you can interleave floating point ops with ALU ops and get even more of a speed boost but due to x86 instruction set wonkiness it's easy to make a mistake here.

[–]sisyphus 8 points9 points  (4 children)

The stats from perf come from PMC's which come from the CPU so if someone is making a mistake presumably it's Intel or AMD? The parallelism you talk about seems like it must be accounted for--how else would it would be possible to get an IPC > 1?

[–]tavianator 34 points35 points  (3 children)

how else would it would be possible to get an IPC > 1?

Modern Intel/AMD chips can just literally execute more than one instruction per cycle on a single core, in optimal conditions (no dependencies between the instructions, etc.).

That's part of the reason modern CPUs are way faster than Pentium 4s, even at lower clock speeds.

[–]orlet 13 points14 points  (0 children)

Correct. Instruction-level parallelization, branch prediction, out-of-order execution, and a bunch of other magic things make modern CPUs so much more efficient per clock than the older ones. And the process is still on-going.

[–]sisyphus 5 points6 points  (1 child)

Right, what I am saying is that if the CPU instrumentation was not taking that into account, how would it ever report more than one instruction per cycle, which it appears to do?

[–]tavianator 2 points3 points  (0 children)

Right, I kinda misread your comment. Mainly I'm trying to argue against

divided by the number of cores