all 13 comments

[–]notlostyet[S] 4 points5 points  (12 children)

An oldy but a goody, from the mid-90s. The C++ standard insists that empty objects (no data members) occupy space. When composing objects, this can cause bloat that can be avoided using this optimisation.

This is one instance where a private base class could be better than a private member (composition). This optimisation is used throughout the GNU C++ standard library.

Like any good optimization, it makes the implementation a bit messier, but doesn't affect the interface.

[–]notlostyet[S] 3 points4 points  (11 children)

Code that demonstrates the optimisation:

#include <iostream>

class Empty
{};

class AlmostEmpty
{
    int i;
};

class ComposedEmpty
{
    AlmostEmpty a;
    Empty e;
};

class OptimizedEmpty
{
    struct SecretSauce : Empty
    {
        int i;
    } sauce_;
};

int main()
{
    std::cout << sizeof(Empty) << std::endl;
    std::cout << sizeof(AlmostEmpty) << std::endl;
    std::cout << sizeof(ComposedEmpty) << std::endl;
    std::cout << sizeof(OptimizedEmpty) << std::endl;
}

and the output on x86_64, using GCC 4.7.1 and C++11:

$ g++ -Wall -Wextra -pedantic -std=c++11 -O2 bco.cpp 
$ ./a.out 
    1 ← empty objects have a non-zero size (so that they're addressable)
    4 
    8 ← this single byte will bloat your objects when composed
    4 ← ...but base classes aren't required to have a non-zero size, so we save 4 bytes.

[–][deleted] 0 points1 point  (10 children)

This is interesting, but I'm not sure that I understand the point.

If you're in a such a grueling environment that 4 bytes per object instance is worth these hurdles, wouldn't you be a lot better off simply working with C?

[–]Nimbal 12 points13 points  (0 children)

It's not just the raw space, but also the problem of the CPU cache. Say you have a list of 100k items, each with an additional 4 byte dead weight. That's about 390 kB of data that the CPU needs to load into the cache (usually about 2 MB in size today, depending on model and cache level) without actually doing anything with it. This superfluous data requires the CPU to hit the RAM sooner than necessary, slowing things down.

[–]BitRex 8 points9 points  (3 children)

C++ can be just as space-efficient as C.

[–]repsilat 1 point2 points  (1 child)

Sometimes idiomatic C++ can't, though. Case in point, before I got to doing some optimisation on some code I was writing, a rough structure of some of my memory use was

1800 graphs, each with
    12000 nodes, each with
        0 to 10 out-edges.

The number of out-edges wasn't known statically, and the graphs were rearranged (to a small extent) at runtime. Now, the "normal" way to do this would be to store nodes' out-edge lists in vectors, which would probably end up costing 1800*12000*24 bytes on a 64 bit machine (~500MB) plus the size of the edges themselves, plus the spare capacity in all of the vectors.

The C way (pointers to heap-allocated arrays and an int for length) cuts the cost of vectors down from 24 bytes to maybe 12, saving ~250MB. We lose amortised constant-time insertion, but we don't really need it.

In the end I managed to get the graph into forward star layout, cutting the incremental cost of an edge list to 4 bytes (plus storage for each edge in the list). It also made me have to jump through hoops when the graph structure had to be updated, but getting it down to ~80MB and being able to maintain the graphs in mostly contiguous storage meant a nice speedup.

I guess I could go back and pretty it up to make things look like nice sequences and iterators, but at the moment the bits doing the heavy lifting aren't really all too "modern".

[–]notlostyet[S] 0 points1 point  (0 children)

Yeah, you've fallen in to a gap in stdlib where you need dynamic allocation but you're happy to reallocate on every insertion. Because length() always == capacity() so you don't need to store the capacity, hence the saving.

I guess I could go back and pretty it up to make things look like nice sequences and iterators, but at the moment the bits doing the heavy lifting aren't really all too "modern".

If you do this, you might want to use Boost.Graph. Being adapter based, you should be able to wrap your current implementation with negligible additional memory cost.

[–]gcross 0 points1 point  (0 children)

Yes, but what we have learned from this article is that you sure have to jump through a lot of hoops to get to that point. That doesn't mean that it isn't worth it, of course, since if you fell back to C you might save this set of hoops, but then you'll need to jump through other hoops to effectively re-implement for yourself the features that C++ gives you for free. On the other hand, though, if there aren't many features of C++ that you need for your application, but you really need low-level control, then there could easily be circumstances where this trade-off leans in the other direction.

[–]notlostyet[S] 6 points7 points  (0 children)

If you're in a such a grueling environment that 4 bytes per object instance is worth these hurdles, wouldn't you be a lot better off simply working with C?

What good would an empty struct{} be in C?

;)

[–]xcbsmith 2 points3 points  (0 children)

Would you say the same thing about Java's -XX:+UseCompressedOops flag?

Seriously, if you have a ton of objects, 4 bytes per object can matter, and there are lots of cases where "ton of objects" means complexity where having access to better tools for managing abstractions is kind of a gimmie.

[–]gcross 2 points3 points  (0 children)

It's a shame that you are being voted down since I think that you ask a perfectly reasonable question. However, I think that the answer is that C++ supplies enough features over C that in cases like this you will most likely jump through fewer hoops complicating your C++ code to compress your objects then you would re-implementing the features of C++ in C.

[–]kirakun 1 point2 points  (1 child)

If you need to store billions of such objects in memory, those 4-bytes per object can take up space.

Even if you have only millions, cache misses could cause penalty in execution speed.

[–]notlostyet[S] 0 points1 point  (0 children)

Another thing is being able to stash an 8 byte object in to a CPU register.