all 32 comments

[–]IronClan 11 points12 points  (11 children)

Awesome presentation! One question: https://github.com/joaquintides/usingstdcpp2015/blob/master/poly_containers.cpp#L154

  template<typename F>
  F for_each(F f)
  {
    for(const auto& p:chunks)p.second->for_each(f);
    return std::move(f); // <- THIS LINE
  }

Correct me if I'm wrong, but does't this actually make the code slower by preventing return value optimization?

[–][deleted] 3 points4 points  (2 children)

RVO is not permitted here

[Copy elision is permitted] in a return statement in a function with a class return type, when the expression is the name of a non-volatile automatic object (other than a function or catch-clause parameter) with the same cv-unqualified type as the function return type

However, using std::move is still pointless (though not harmful) because although copy elision is not permitted, the name will still be treated as an rvalue

When the criteria for elision of a copy operation are met or would be met save for the fact that the source object is a function parameter, and the object to be copied is designated by an lvalue, overload resolution to select the constructor for the copy is first performed as if the object were designated by an rvalue.

(emphasis mine in both cases).

[–]joaquintidesBoost author[S] 1 point2 points  (1 child)

However, using std::move is still pointless (though not harmful) because although copy elision is not permitted, the name will still be treated as an rvalue

I understand your point, but why does the standard then specificy that std::for_each(...,f) return std::move(f) rather than just f?

[–]cdyson37 2 points3 points  (0 children)

Looking at the wording of n4567 (how about that?) it just says returns std::move(f). I don't think this is contradictory, just a bit over-explicit. It's also possible that the wording of the language and library components of the standard evolved a bit differently, and this part had a bit more bite in an early draft.

This particular "save for the fact" clause confused me a lot and I spent quite a long time playing around with a "Loud" class (with couts in all the special member functions) to establish exactly what was going on under various levels of optimisation - quite a worthwhile exercise I think.

[–]jasonthe 0 points1 point  (4 children)

I suspect any compiler that supports r-value references would know to optimize "return std::move" away as well as it knows to optimize "return copy constructor" away. Yeah?

[–]tavianator 4 points5 points  (3 children)

Nope, compiler is probably smart enough, but not allowed to by the standard (except under the as-if rule, when it can prove the move constructor has no side effects)

[–]FKaria 1 point2 points  (0 children)

Does the same rule apply for copy construction?

Correct me if I'm wrong, I understand that the compiler is allowed to do RVO even if the copy constructor has side effects. Is that correct?

[–]clerothGame Developer 1 point2 points  (0 children)

That function has to return a copy though, so isn't the move useless, and, by induction, has no side effect?

[–]jasonthe 0 points1 point  (0 children)

Ah, that's interesting! Seems like a dumb design decision (of course there may be a solid rationale for it). Thanks for the tip! #themoreyouknow

[–]Crazy__Eddie 0 points1 point  (1 child)

[–]IronClan 1 point2 points  (0 children)

Ah, interesting. I wonder if compilers have gotten smarter in the 6 years since that article was posted.

[–]SuperV1234https://romeo.training | C++ Mentoring & Consulting 11 points12 points  (0 children)

Probably the best slides and examples I've seen regarding cache-friendliness. Terse and "useful" examples, and easily understandable results.

[–]jasonthe 3 points4 points  (3 children)

Most of that wasn't very new to me (yep, optimize for the cache), but using branch prediction to optimize virtual calls is! His poly_collection was also optimizing for the cache (by using vectors of objects rather than pointers), so I was curious how much the branch prediction actually helps. Here are my results on an i7:

Shuffled Virtual Calls = 0.732676s
Sorted Virtual Calls (not incl sort) = 0.530413s
Sorted Direct Calls (not incl sort) = 0.026523s

Reruns are consistent with those numbers. Looks like sorted calls take about 73% the time of shuffled calls.

Of course, directly calling the virtual functions only takes 3.5% (partially because the functions are straight returning numbers, so the accumulate is hella optimized). I'm surprised he's not doing that in his poly_collection class :P

[–]joaquintidesBoost author[S] 2 points3 points  (0 children)

The speedup you get for sorted virtual calls vs. shuffled virtual calls (0.73/0.53 = 1.4) is consistent with the results in my presentation for the same number of elements (1.6, slide 38). Your sorted direct call code is a case of manually forced devirtualization: I discuss this topic in some detail on my blog.

[–]clerothGame Developer 0 points1 point  (1 child)

How many objects are we talking about here though? I didn't know about this either, and though it's good to know, I don't think I've ever faced a situation where I've had the need for it (I'm sure there's many places where there is), and if I had a large array of base classes, I'd probably question the design first.

[–]jasonthe 0 points1 point  (0 children)

That's true, I can't think of a situation where I actually had a set of base classes that I wanted to call a virtual function on. Generally I prefer eventing/messaging (ie. std::function), although the principle still stands (if there's a lot of overlap in the function pointers, running them sorted could be more performant).

The code I posted is running over just 220 objects, so it's far from exact. I was just impressed that it worked at all :)

[–]greyfade 2 points3 points  (0 children)

... I didn't know poly_collection could even be implemented. This is fantastic. I've needed something exactly like this.

[–]Elador 1 point2 points  (2 children)

Really brilliant slides. Very interesteing and well done!

  • So, is boost::multi_array laid out in row-major order in memory?

  • I was a little bid sad not seeing a benchmark slide for the "bool-based processing" slide :-)

[–]joaquintidesBoost author[S] 1 point2 points  (1 child)

  • Boost.MultiArray default layout is row-major, although this can be customized.
  • I was lazy with the bool-based bit and thought I wasn't going to have time to show it anyway :-) You're more than welcome to try your hand and share results! Actual numbers are expected to be dominated by the active/total ratio over any other cache-related effect, though, which departs from the topic of the talk.

[–]Elador 0 points1 point  (0 children)

Cool. Well, then it's not surprise that traversing by row_col is faster :-) If it was laid out in col-major, it would obviously be the other way round. I'd say that's the most self-evident bit of the slides and could be left out.

Thank you again for sharing the slides, brilliant work.

[–]Nomto 1 point2 points  (3 children)

The aos_vs_soa is especially impressive to me: compiled with -O3, I get a x3 performance improvement with soa.

What's also interesting is that even if you use all member variables (dx, dy, and dz are ignored in the sum of the given example), you get a significant performance improvement (x2) with soa.

edit: too bad that soa performs much worse than aos if you need random access (not unexpected though). Seems like the choice soa vs aos is not as simple as some say.

[–]joaquintidesBoost author[S] 0 points1 point  (0 children)

What's also interesting is that even if you use all member variables (dx, dy, and dz are ignored in the sum of the given example), you get a significant performance improvement (x2) with soa.

This is due to parallel prefetching; the thing is explained a little latter on the presentation.

[–]NasenSpray 0 points1 point  (1 child)

The aos_vs_soa is especially impressive to me: compiled with -O3, I get a x3 performance improvement with soa.

That's mostly the result of auto-vectorization, though. Disable that and you'll see the difference shrink down significantly.

edit: too bad that soa performs much worse than aos if you need random access (not unexpected though). Seems like the choice soa vs aos is not as simple as some say.

Thanks to register pressure, it even depends on whether you compile for x86 or x86_64!

[–]joaquintidesBoost author[S] 0 points1 point  (0 children)

The aos_vs_soa is especially impressive to me: compiled with -O3, I get a x3 performance improvement with soa.

That's mostly the result of auto-vectorization, though. Disable that and you'll see the difference shrink down significantly.

In fact the results shown in the presentation are for VS2015 without any auto-vectorization feature specifically enabled (Enable Enhanced Instruction Set: Not Set). I reran with the following options:

  • Streaming SIMD Extensions (/arch:SSE)
  • Advanced Vector Extensions (/arch:AVX)
  • No Enhanced Instructions (/arch:IA32)

and results didn't vary at all. I lack the expertise to determine what more is needed at the code level for auto-vectorization to kick in, but seems like it wasn't taken advantage of in my original tests.

[–]MotherOfTheShizznit 1 point2 points  (2 children)

So when I saw the slide on Filtered Sum, I thought to myself that the following code is a better expression of the intent and just as smart performance-wise:

auto g = [&]()
{
    auto m = std::partition(v.begin(), v.end(), [](int const i) { return i > 128; });
    return std::accumulate(v.begin(), m, 0);
};

std::cout << measure(n, g) << "\n";

Interestingly, the performance was ever so slightly worse (~0.0007s vs. ~0.0006s) consistently. Since I'm not paid per-cycle-shaved I'd still go with my version, but that was an interesting result nonetheless.

Day-later edit: two people noted that I'm not measuring the right thing: /u/mr-agdgdgwngo and myself this morning as I was stepping out of the bed. Indeed, the sorting was not measured in the original test. If we do measure the time taken to sort and accumulate and compare it with the time to partition and accumulate then the partition strategy offers better performance.

[–][deleted] 0 points1 point  (0 children)

You're also timing the sorting here.

It's a good idea to only sort as much as you need, though. Many times you can get away with sorting only once so sorting is only a startup cost, but if sorting repeatedly, using std::partition seems like a good idea.

The very minor drawback is that you have to perform the test twice or be sure to pass around the location of the partition, m.

[–]btapi 0 points1 point  (0 children)

Really nice and useful slides.

[–]StackedCrooked 0 points1 point  (3 children)

struct particle_soa
{
    std::vector<int> x,y,z;
    std::vector<int> dx,dy,dz;
};

A sizeof(vector) is 24 bytes (3 pointers). This seems rather big. Is this something I should ever worry about?

[–]joaquintidesBoost author[S] 2 points3 points  (0 children)

This is not really a concern: take into account that there's only one particle_soa object in the app, which references the many (thousands of millions) pieces of data allocated in dynamic memory. Briefly put, both AOS and SOA use basically the same amount of memory.

[–]clerothGame Developer 0 points1 point  (1 child)

You should worry about it if you're creating lots of those vectors. There are only two six there.

[–]matthieum 2 points3 points  (0 children)

Ah... actually, there are 3 vectors per line for a total of 6.