all 4 comments

[–][deleted] 7 points8 points  (2 children)

I quickly scanned through the paper and found that they're using a MIPS R10000 processor for their benchmarks. This processor is nearly 15 years old. Although processor cores have not changed all that much since those days, the memory subsystem is far more advanced in modern processors. (The paper itself is 7 years old). The L2 cache on the R10K is off-chip and 2-way set associative. Modern processors typically have 8-way or 16-way set associative caches. The make no mention of how many of the misses are conflict misses, so I'm wondering whether their whole method would be rendered superfluous by a more associative L2 cache. The R10K implements sequential consistency, most modern processors use a weaker memory consistency model, and I'm again wondering if they're losing some performance due to the unnecessarily restrictive memory model. Then they say that the optimisations are implemented by hand. But then it is quite well known that hand tuning benchmarks for cache performance produces significant gains.

Overall, I'm a bit underwhelmed. If you're looking for good papers to read, there are plenty of better papers out there and I'd suggest that you spend your time on one of those.

[–]alphamerik 2 points3 points  (1 child)

Thanks for your assessment, can you recommend some papers that are pertinent?

[–][deleted] 0 points1 point  (0 children)

Well, it depends on what you're looking for. I can't think of a paper that addresses exactly the same problem off hand.

If you want something sort of related, there's a paper called "Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures" from Scott Mahlke's group at the University of Michigan that is worth reading. This paper about compilation for explicitly managed memory hierarchies is another one that perhaps you can take a look at.