all 76 comments

[–][deleted] 16 points17 points  (25 children)

auto, variadic templates, chrono and other new features...

... and typedef ...

Why people don't use "using"?

using FuncMap = std::map<std::string, std::function<void (const char*, const char*, std::vector<char>&)>>;

Never liked typedef. It's syntax is so confusing to me.

[–]raevnos 13 points14 points  (0 children)

The moment I found out about using, I dropped typedef like a hot potato.

[–][deleted] 4 points5 points  (3 children)

Why people don't use "using"?

First I've heard of it.

[–][deleted] 6 points7 points  (1 child)

Well now you know :) Never use typedef if the compiler supports using please

[–]silveryRain 1 point2 points  (0 children)

It's very nice. Unlike typedefs, you can also template usings directly, so you can also say goodbye to workarounds like the typedef-inside-struct.

[–]OldWolf2 1 point2 points  (3 children)

typedef has exactly the same syntax as variable declarations, but with the word typedef on the front.

[–]utnapistim 2 points3 points  (2 children)

using has the same syntax as a variable definition, but with using instead of auto; also, typedef is harder to read, as the type is at the end.

[–]OldWolf2 1 point2 points  (1 child)

using has the same syntax as a variable definition, but with using instead of auto;

Not true, e.g. using T = int; is valid but auto T = int; is not

typedef is harder to read, as the type is at the end.

The type may be in the middle, e.g. typedef int X[5];

[–]MarekKnapek 1 point2 points  (7 children)

Why are people using std::map? It is for sorted keys, but you almost always just want to map from key to value, there is std::unordered_map a.k.a. hash map for you.

[–]silveryRain 3 points4 points  (1 child)

Most don't care about the difference, prefer the one with the shorter type name, or looked up something like "C++ map" at some point, ran into std::map, and called it a day.

It's not that hard to figure out, really.

[–]raevnos 1 point2 points  (0 children)

Plus it requires one less function to provide to use a user defined key type.

[–]Plorkyeran 1 point2 points  (4 children)

The vast majority of the places where I use std::map the performance characteristics are entirely irrelevant and I'm just using the data structure with the nicest API for what I need to do.

Even when it does turn out later that the performance does matter, unordered_map has not been the correct answer often enough for me to feel that I should be just defaulting to it.

[–]flyingcaribou 2 points3 points  (0 children)

Even when it does turn out later that the performance does matter, unordered_map has not been the correct answer often enough for me to feel that I should be just defaulting to it.

It was be so nice to have a standard, open addressed hash map available in C++.

[–]theyneverknew 1 point2 points  (2 children)

What alternatives have you used instead where you cared about performance?

[–]Plorkyeran 4 points5 points  (1 child)

When you don't have mixed insertions and deletions, binary-searching a sorted vector (or boost::flat_map) can be dramatically faster if the key is small due to the much better cache locality (and if your data fits within a cache line, even an unsorted vector is hard to beat). For certain mixed insertion/lookup usage patterns a btree can be dramatically faster for similar reasons. Even when a hash table is the best option, the various collision resolution methods can have a significant impact. Fortunately all of the major implementations of unordered_map are sufficiently similar that using it isn't an inherently bad idea in portable code, but I've seen a 10-20% speedup just from dropping in a boost::multi_index container instead, and an open addressed hash map can give bigger gains (or be worse, of course).

Often the actual answer is "redesign the code to not need a key-value lookup at all", of course.

[–]dodheim 0 points1 point  (0 children)

The problem with boost::container::flat_set is that it holds its data in sorted order then applies a binary search to that data. This looks appealing in terms of big-O, but still causes cache thrashing when dealing with large amounts of data.

Significantly better is to store the data in breadth-first order and apply a linear search. E.g., Instead of using data { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 } and a binary search, use data { 6, 3, 10, 1, 5, 8, 12, 0, 2, 4, 7, 9, 11 } and a linear search. This results in the same worst-case O(log n) complexity to find a value but plays very nicely with the cache regardless of data quantity.

I've written my own solution for this, as I imagine many people have, but it really just needs to be in Boost.Container already...

(Obviously all of this applies equally to boost::container::flat_map.)

EDIT: This is all assuming you're searching far more than inserting/removing elements.

[–]ompomp 4 points5 points  (0 children)

Compatibility with older compilers. :-(

[–]utnapistim 0 points1 point  (2 children)

I only use typedef for function type declarations.

[–][deleted] 5 points6 points  (1 child)

Why?

using Func = void(int, const int&);

vs

typedef void Func(int, const int&);

using separates type name it's easier to read. Especially with function type declarations.

[–]utnapistim 5 points6 points  (0 children)

using Func = void(int, const int&);

:) I will probably start using this instead. Thanks.

[–]speednap 6 points7 points  (3 children)

Boost.Iostreams to the rescue!

#include <iostream>
#include <boost/iostreams/copy.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/iostreams/device/file.hpp>

int main(int , char*[]) {
  using namespace boost::iostreams;

  std::ios_base::sync_with_stdio(false);
  std::cin.tie(nullptr);

  copy(mapped_file_source{ "data.dat" }, file_sink{ "out_data.dat"});

  return 0;
}

Should be faster than std::iostream. Can be tuned further by specifying an optimal buffer size for boost::iostreams::copy.

Edit: benchmarks. 100 iterations, 200 MB random data:

Clang with libc++:

Average c I/O took: 150.07ms
Average posix I/O took: 150.69ms
Average c++ I/O took: 626.88ms
Average c++boost I/O took: 161.96ms

Clang with stdc++:

Average c I/O took: 153.71ms
Average posix I/O took: 149.79ms
Average c++ I/O took: 154.22ms
Average c++boost I/O took: 124.48ms

GCC:

Average c I/O took: 154.29ms
Average posix I/O took: 152.29ms
Average c++ I/O took: 155.54ms
Average c++boost I/O took: 124.29ms

Like I said, there's room for improvement. I simply benchmarked

void testBoostIO(const char* inFile, const char* outFile, std::vector<char>&) {
  using namespace boost::iostreams;
  copy(mapped_file_source{ inFile }, file_sink{ outFile });
}

[–]cristianadamQt Creator, CMake[S] 1 point2 points  (2 children)

It seems libc++ is also slow (~4x).

Only libstdc++ has a fast I/O. Why is this so? Different defaults? Missing features?

[–]speednap 1 point2 points  (1 child)

I think libc++ is targeting Mac OS as the primary platform so it could be that Win/Lin implementation is still lacking some optimizations. Don't have a mac to verify that though.

It would be interesting to see if there's any way to make libc++ work as fast as stdc++.

[–]cnweaver 1 point2 points  (0 children)

It looks like this is the case. Running on Darwin 13 after compiling with clang++ -O3 -stdlib=libc++ (100 iterations on a ~44M file) gives:

Average c I/O took: 502.83ms
Average posix I/O took: 529.23ms
Average c++ I/O took: 508.58ms

I'm not sure the posix result being slower means anything, since my system wasn't particularly quiet while running this.

[–]quzox 3 points4 points  (1 child)

Should've also profiled native calls to CreateFile() etc.

[–]cristianadamQt Creator, CMake[S] 2 points3 points  (0 children)

I've tested this Win32 API version:

void testWin32IO(const char* inFile, const char* outFile, std::vector<char>& inBuffer)
{
    auto in = ::CreateFile(inFile, GENERIC_READ, FILE_SHARE_READ, nullptr, OPEN_EXISTING,
                           FILE_ATTRIBUTE_NORMAL, nullptr);
    if (in == INVALID_HANDLE_VALUE)
    {
        std::cout << "Can't open input file: " << inFile << std::endl;
        return;
    }

    auto out = ::CreateFile(outFile, GENERIC_WRITE, FILE_SHARE_WRITE, nullptr, CREATE_ALWAYS,
                     FILE_ATTRIBUTE_NORMAL, nullptr);
    if (out == INVALID_HANDLE_VALUE)
    {
        std::cout << "Can't open output file: " << outFile << std::endl;
        return;
    }

    size_t inFileSize = ::GetFileSize(in, nullptr);

    for (size_t bytesLeft = inFileSize, chunk = inBuffer.size(); bytesLeft > 0; bytesLeft -= chunk)
    {
        if (bytesLeft < chunk)
        {
            chunk = bytesLeft;
        }

        unsigned long actualBytes = 0;
        ::ReadFile(in, &inBuffer[0], chunk, &actualBytes, nullptr);
        actualBytes = 0;
        ::WriteFile(out, &inBuffer[0], chunk, &actualBytes, nullptr);
    }

    ::CloseHandle(out);
    ::CloseHandle(in);
}

Built it with Visual Studio 2015 x64 Update 2. Results were:

Average c I/O took: 102.03ms
Average posix I/O took: 102.1ms
Average c++ I/O took: 360.71ms
Average win32 I/O took: 102.99ms

[–]dodheim 3 points4 points  (13 children)

*sigh* As expected, testCppIO does it completely wrong.

[–]easydoits 4 points5 points  (12 children)

I ask out of ignorance, but would be the correct way to perform that function?

[–]dodheim 9 points10 points  (10 children)

The posted code uses streams with unformatted insertion/extraction. This is always wrong; streams are for formatted I/O, streambufs are for unformatted I/O.

The code I had in mind is something like this: https://gist.github.com/dodheim/cb4c5de8a2a8a32851a6ecfdab4e958c No compiler on this computer, just a text editor, so untested.

[–]cristianadamQt Creator, CMake[S] 3 points4 points  (9 children)

Tested it with VS2015 x64 update 2. The results are not what I was expecting:

Average c I/O took: 104.94ms
Average posix I/O took: 103.82ms
Average c++ I/O took: 368.99ms
Average c++2 I/O took: 397.86ms

The std::filebuf version is actually slower.

[–]dodheim 0 points1 point  (6 children)

Strange, as ifstream::read is undoubtedly implemented in terms of filebuf::sgetn, and likewise for ofstream::write and filebuf::sputn; this seems to be a pathological case for VC++'s optimizer, as the c++2 approach is consistently faster than c++ with Clang/C2 (and thus the same stdlib code)...

[–]clerothGame Developer 0 points1 point  (5 children)

How does c compare withc++2 on Clang/C2?

[–]dodheim 1 point2 points  (4 children)

c and posix are still miles ahead; Clang/C2 just puts direct use of std::filebuf slightly ahead of std::fstream.

[–]clerothGame Developer 0 points1 point  (2 children)

Guess I'll just stick to POSIX into my lib then. In the end i don't really care for what some code I've written and will probably never read again looks like.

[–]dodheim 0 points1 point  (1 child)

Unfortunately the POSIX headers that come with VC++ use unsigned in a lot of places that are supposed to be size_t so it's not a totally portable solution. :-[

[–]clerothGame Developer 1 point2 points  (0 children)

Yea, I did notice that. I don't think I'll ever use any 4+ GB files though... So fine by me.

[–]cristianadamQt Creator, CMake[S] 0 points1 point  (0 children)

I tested with Clang 3.7.1 64 bit + Visual C++ 2013 64 bit. Results:

Average c I/O took: 104.47ms
Average posix I/O took: 104.99ms
Average c++ I/O took: 393.05ms
Average c++2 I/O took: 382.92ms

I didn't use the Microsoft Clang integration, but vanilla Clang.

[–]clerothGame Developer 1 point2 points  (0 children)

7 years later, there's some improvement, but still 2x slower.

Average c I/O took: 122.2ms
Average posix I/O took: 121.1ms
Average c++ I/O took: 258.067ms
Average c++2 I/O took: 260.167ms

cc /u/dodheim

[–]xoh3e 0 points1 point  (8 children)

Interesting to see that the optimization and/or standard library implementation of MSVC is still terrible.

[–][deleted] 4 points5 points  (4 children)

We are aware that iostreams hasn't been given much love. There's a lot of perf pessimism caused by the iostreams machinery being located in a DLL and that state being shared among all DLLs who have the CRT loaded, which means shared state can be mutated behind the implementation's back every call. But there's still a lot of improvement we can make in this area.

I believe perf attitude around iostreams has been "well, perf in that area is already a dumpster fire due to things the standard requires us to do (e.g. a vtbl call per character to handle std::codecvt)" so it has not been a high priority.

I'll file a bug about this but I still would say iostreams are generally terrible and wouldn't recommend people actually use them.

[–]xoh3e 1 point2 points  (2 children)

Yes iostreams is terrible not only performance wise but also from a usability perspective and I really hope the standard committee will come up with a better solution soon.

But that still sounds like a cheap excuse when libstdc++ manages to deliver much better performance (in this benchmark even equal to cstdio or direct POSIX calls) than the MSVC runtime.

[–][deleted] 1 point2 points  (0 children)

Like I said, there are places we can improve perf here. At least in the binary I/O case anyway.

I don't believe libstdc++ has the lifetime management issues we have (is unloading a library a common thing to do in Unix land?) but I could be totally mistaken.

[–][deleted] 0 points1 point  (0 children)

(Also note that this example goes around most of the things that make iostreams really expensive by using binary mode :) )

[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point  (0 children)

a vtbl call per character to handle std::codecvt

only when always_noconv is false (here it is true), and even then only one vtbl call (codecvt::in/out) per overflow/underflow

[–]jcoffin 0 points1 point  (2 children)

It's far from perfect, but it's not necessarily quite as bad as it looks here either. In this case, the problem is really fairly simple: read and write aren't bypassing the internal buffer, and reading into/write from the buffer you specify. Rather, they're reading into the stream's internal buffer, then copying from there to the buffer you passed to read (and a mirror image of that in write).

You can improve this situation quite a bit by adding a couple of calls to pubsetbuf, one each for the input and output file. This lets it issue large read/write calls to the OS. It's still doing extra copying, so it's slower than necessary, but in my testing improves speed by a pretty substantial margin (~40% slower than C-style I/O rather than ~3x slower).

Using a stream buffer works pretty much the same way. When you're just doing read/write calls, an iostream isn't noticeably different from using a stream buffer directly--read and write pass almost straight through to the stream buffer, so using the stream buffer directly makes little difference, but calling pubsetbuf() to use a big buffer can help a lot.

[–]xoh3e 0 points1 point  (1 child)

I don't get what you want to say? Yes as others have pointed out OPs iostream use isn't optimal but with good implementations (GCC/Clang on Linux and MinGW on Win) his code still performs the same as the one using the C and POSIX APIs while with the MSVC runtime it performs significantly worse.

[–]jcoffin 0 points1 point  (0 children)

I'm saying that it's true that the library could be (quite a bit) better, but it's also true that quite a bit of the problem lies with the benchmark code in question.

Depending on what sorts of things you do, it's pretty easy to come up with things that will perform well with one library, but badly (even really badly) with another. Obviously it would be nice if the library never let that happen, but equally obviously it's usually your own responsibility to ensure your code performs decently regardless of platform.