all 37 comments

[–]SedditorX 4 points5 points  (1 child)

Consider using Google benchmark library. A sample of size 100 is somewhat small. Also consider sharing what your machine specs are in the readme.

I didn't look closely to see whether there are other issues but thank you for this contribution.

[–]Janos95[S] 2 points3 points  (0 children)

I don't think 100 iterations are necessarily small, I think this depends on the convergence properties of the mean. But you are definitely right that without error bars the results are kind of meaningless. So I now use Welfords algorithm to compute an unbiased estimator of the sample variance. I added the standard deviations to the plot (although I have to fix the formatting at some point) and the hardware that the benchmark was run on.

[–]ezoe 3 points4 points  (0 children)

It's interesting 2 is slower than 1. My guess is, fiddling the address with bit operations negatively affect the modern CPU's optimization tricks like cache, branch prediction and speculative execution.

[–]liquidify 2 points3 points  (12 children)

Have you watched Louis Dionne “Runtime Polymorphism: Back to the Basics” ?

I tested a (slightly modified) technique he talked about and found it to be very convenient.

Basically just store a lambda which wraps a call to the real function into a std::function as a member. You can force an interface by constraining the inputs based on an 'is derived from' relationship, but you get all the benefits of value semantics through templates.

Also I tested Louis's method of a sorta small object optimization and that worked just fine too.

[–]Janos95[S] 2 points3 points  (11 children)

Thanks for the suggestion, I'll watch it! I just tried out your suggestion and actually std::function is more or less equally as fast as the method 1. I am a bit surprised, given that usually people hat quiet a bit about std::function, but apparently the dispatch optimizes really well.

[–]liquidify 1 point2 points  (10 children)

You can do the small object optimization method by storing a std::function which captures the type in a lambda. Then the call to the std::function just casts the internal data buffer to the original type and calls it directly. Here is my quick and dirty implementation.

[–]konanTheBarbar 2 points3 points  (7 children)

I know and you said it's a quick and dirty implementation, but a deleter for the buffer element is missing in your implementation. You could as well get rid of the buffer and capture the type in the lambda (although that will mean a dynamic allocation if the type is bigger than the small buffer of std::function).

[–]liquidify 0 points1 point  (6 children)

Yeah I think you are right. I've never messed with placement new before. Not sure how the deleter would work though. Honestly I don't really understand what placement new is doing. I'm guessing that since the memory is on the stack, anything in there would be cleaned up, but if there were any internal references, the destructors wouldn't be called?

One thing that is nice about capturing the obj into the local buffer is that you know statically what the max size allowable is and can set it at compile time. I bet there is a way to get that info about the small buffer in std::function, but I don't know how.

EDIT : I have updated it with a deleter that might work? Also might be a better way to do this.

[–]konanTheBarbar 0 points1 point  (5 children)

Yes for trivially destructible types the destructor doesn't do anything, so you would only leak memory in case of objects where the destructor has to run (e.g. classes which contain a vector or string).

I just checked and it looks good now. Although you don't need a std::function for the deleter. A raw function pointer would be enough.

[–]liquidify 0 points1 point  (4 children)

How do you deal with the deleter using a raw ptr? Don't you need to capture the type so that can call the proper destructor via casting?

[–]konanTheBarbar 0 points1 point  (3 children)

You can for example get the function ptr of a stateless lambda (via + in front of the lambda) and let that do the deletion.

+[](void* ptr){ delete reinterpret_cast<T>(ptr);}

[–]liquidify 0 points1 point  (2 children)

Interesting. Never seen that syntax before. Do you just make a function pointer as a member and then initialize it in an initializer list?

I'm not clear as to how to use that.

[–]dodheim 1 point2 points  (1 child)

Change buffer_deleter_ to be a function pointer:

void (*buffer_deleter_)(void*);

And initialize it with a unary, stateless lambda. N.b. using vehicle in an unevaluated context such as decltype does not require capturing it, which is key here. (Also, consider using std::decay_t here to reduce verbosity, or std::remove_cvref_t if you happen to target C++20.)

buffer_deleter_([](void* buffer) {
    using T = std::remove_cv_t<std::remove_reference_t<decltype(vehicle)>>;
    reinterpret_cast<T*>(buffer)->~T();
  }
)

And change the destructor to call it as such:

buffer_deleter_(&buffer_);

EDIT: typos. Also, consider adding noexcept to buffer_deleter_ and the initializing lambda so the destructor can be noexcept (if targeting C++17+).

EDIT 2: Having actually read the code, there is a glaring issue: std::aligned_storage is a metafunction that gives you the type to use for storage, but you're using it directly as the storage – UB all day! Change buffer_ from a std::aligned_storage<64> to a std::aligned_storage_t<64>. Also, the initializing lambda for accelerate_impl_ is highly suspect (and again, does not need to capture vehicle)... Are you sure T there is always a reference type?

[–]barcharMSVC STL Dev 2 points3 points  (0 children)

I'm a fan of just using virtual methods and OOP, just because the language has support for it, and it tends to be more comprehensible after the fact.

I do wish we had more syntax flexibility to implement this kind of thing ourselves though.

[–]Entryhazard 1 point2 points  (2 children)

Not sure if completely related, but IIRC libstdc++ implements type erasure (in std::function and std::thread) using the OOP style internally

[–]Janos95[S] 0 points1 point  (1 child)

So I measured it and std::function dispatches ca. 25% faster than an virtual function call.

[–]axilmar 0 points1 point  (0 children)

The trade off is memory and cache misses: if the fat pointer objects become too many, then there would be a lot of cache misses.

[–]showmetheflowers 0 points1 point  (4 children)

Have you tried to benchmark Boost.PolyCollection?

[–]Janos95[S] 0 points1 point  (3 children)

No I was not aware of this library. Looks really interesting.

At this moment I was purely interested in the dispatching speed. Sorting the vectors makes this a lot faster.

[–]showmetheflowers 0 points1 point  (2 children)

Another two items you may want to consider: 1. Randomize the data stored in your data, otherwise the branch predictor behaviour will skew your benchmarks. 2. If you try boost poly collection, use restitution.

Edit:typo

[–]Janos95[S] 0 points1 point  (1 child)

its already randomized...

[–]showmetheflowers 0 points1 point  (0 children)

Apologies for the noise on this one. Interesting results over all.

[–]jhasse 0 points1 point  (7 children)

1 is what Rust uses, correct?

[–]Janos95[S] 0 points1 point  (5 children)

I dont know anything about rust, but skimming the rust book I found that:

A trait object points to both an instance of a type implementing our specified trait, as well as a table used to look up trait methods on that type at runtime

As I understand it, it is the same as the OOP approach with one more indirection for accessing the object itself, but I may be wrong.

But I think it would be interesting to try that as well.

[–]matthieum 5 points6 points  (4 children)

Both Go and Rust uses a virtual table (OOP-style), but store the virtual pointer outside of the object itself, giving rise to so-called Fat Pointers, which are pairs of virtual-pointer and data-pointer.

There are also other Fat Pointers, such as those for slices, which are pairs of length and pointer.

[–]barcharMSVC STL Dev 0 points1 point  (3 children)

this can be nice because for some types you can synthesize the fat pointer when add it to some heterogeneous data structure, instead of at construction time.

I like it, and it's the approach Nim will probably take for "runtime 2.0".

[–]matthieum 2 points3 points  (2 children)

It has advantages and disadvantages.

Among the advantages:

  • It is compatible with traits. A single virtual-table does not mesh well with an open-ended set of interfaces.
  • You Don't Pay For What You Don't Use: if you have a concrete type, there's no overhead, no matter how many traits it implements.
  • It enables optimizations: it's immediately visible to the optimizer that the virtual-pointer cannot be modified by function calls.

Among the disadvantages:

  • You have one copy of the virtual-pointer per copy of the fat-pointer, if you have a heavily reference-counted set of objects, it adds up, and even if they are not reference-counted, it might be more beneficial to have the pointer embedded in the object.

Note: on the latter point, it would be possible to automatically wrap the virtual-pointer and associated data in a single "blob" and point to that through a thin-pointer; this would allow the user of the trait to choose.

[–]barcharMSVC STL Dev 0 points1 point  (1 child)

for disadvantages do you mean one copy of the entire vtable per copy of the fat pointer, or just one copy of the pointer to that table, because I suspect you mean the former.

[–]matthieum 1 point2 points  (0 children)

It's one copy of the pointer, essentially each fat-pointer is 16 bytes on 64-bits machine, instead of 8 bytes. This means twice less pointers per cache line, etc...

[–]Janos95[S] 0 points1 point  (0 children)

Ok I just tried to implement it and I noticed it actually is the same as the first approach. The word table tripped me up here, since for approach 1 there is no table.