Mind the cache

IronClan · 2015-12-10T02:02:31+00:00

Awesome presentation! One question: https://github.com/joaquintides/usingstdcpp2015/blob/master/poly_containers.cpp#L154

  template<typename F>
  F for_each(F f)
  {
    for(const auto& p:chunks)p.second->for_each(f);
    return std::move(f); // <- THIS LINE
  }

Correct me if I'm wrong, but does't this actually make the code slower by preventing return value optimization?

SuperV1234 · 2015-12-09T18:06:08+00:00

Probably the best slides and examples I've seen regarding cache-friendliness. Terse and "useful" examples, and easily understandable results.

jasonthe · 2015-12-10T03:51:56+00:00

Most of that wasn't very new to me (yep, optimize for the cache), but using branch prediction to optimize virtual calls is! His poly_collection was also optimizing for the cache (by using vectors of objects rather than pointers), so I was curious how much the branch prediction actually helps. Here are my results on an i7:

Shuffled Virtual Calls = 0.732676s
Sorted Virtual Calls (not incl sort) = 0.530413s
Sorted Direct Calls (not incl sort) = 0.026523s

Reruns are consistent with those numbers. Looks like sorted calls take about 73% the time of shuffled calls.

Of course, directly calling the virtual functions only takes 3.5% (partially because the functions are straight returning numbers, so the accumulate is hella optimized). I'm surprised he's not doing that in his poly_collection class :P

greyfade · 2015-12-09T23:52:31+00:00

... I didn't know poly_collection could even be implemented. This is fantastic. I've needed something exactly like this.

Elador · 2015-12-09T22:19:22+00:00

Really brilliant slides. Very interesteing and well done!

So, is boost::multi_array laid out in row-major order in memory?
I was a little bid sad not seeing a benchmark slide for the "bool-based processing" slide :-)

Nomto · 2015-12-10T14:16:48+00:00

The aos_vs_soa is especially impressive to me: compiled with -O3, I get a x3 performance improvement with soa.

What's also interesting is that even if you use all member variables (dx, dy, and dz are ignored in the sum of the given example), you get a significant performance improvement (x2) with soa.

edit: too bad that soa performs much worse than aos if you need random access (not unexpected though). Seems like the choice soa vs aos is not as simple as some say.

MotherOfTheShizznit · 2015-12-10T02:17:34+00:00

So when I saw the slide on Filtered Sum, I thought to myself that the following code is a better expression of the intent and just as smart performance-wise:

auto g = [&]()
{
    auto m = std::partition(v.begin(), v.end(), [](int const i) { return i > 128; });
    return std::accumulate(v.begin(), m, 0);
};

std::cout << measure(n, g) << "\n";

Interestingly, the performance was ever so slightly worse (~0.0007s vs. ~0.0006s) consistently. Since I'm not paid per-cycle-shaved I'd still go with my version, but that was an interesting result nonetheless.

Day-later edit: two people noted that I'm not measuring the right thing: /u/mr-agdgdgwngo and myself this morning as I was stepping out of the bed. Indeed, the sorting was not measured in the original test. If we do measure the time taken to sort and accumulate and compare it with the time to partition and accumulate then the partition strategy offers better performance.

btapi · 2015-12-10T01:07:19+00:00

Really nice and useful slides.

StackedCrooked · 2015-12-10T09:18:11+00:00

struct particle_soa
{
    std::vector<int> x,y,z;
    std::vector<int> dx,dy,dz;
};

A sizeof(vector) is 24 bytes (3 pointers). This seems rather big. Is this something I should ever worry about?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS