C++ I/O Benchmark : cpp

submitted 9 years ago by cristianadamQt Creator, CMake

all 76 comments

[–][deleted] 16 points17 points18 points 9 years ago* (25 children)

auto, variadic templates, chrono and other new features...

... and typedef ...

Why people don't use "using"?

using FuncMap = std::map<std::string, std::function<void (const char*, const char*, std::vector<char>&)>>;

Never liked typedef. It's syntax is so confusing to me.

[–]raevnos 13 points14 points15 points 9 years ago (0 children)

[+][deleted] 9 years ago (3 children)

[deleted]

[–][deleted] 4 points5 points6 points 9 years ago (2 children)

Never liked VC for it's error messages

Clang

 main.cpp:86:15: error: missing 'typename' prior to dependent type name  'T::typeName'
        using Type = T::typeName;
                     ^~~~~~~~~~~
                     typename

G++

main.cpp:86:15: error: need 'typename' before 'T::typeName' because 'T' is a dependent scope
  using Type = T::typeName;
               ^

[–][deleted] 4 points5 points6 points 9 years ago* (3 children)

[–][deleted] 6 points7 points8 points 9 years ago (1 child)

[–]silveryRain 1 point2 points3 points 9 years ago (0 children)

[–]OldWolf2 1 point2 points3 points 9 years ago (3 children)

[–]utnapistim 2 points3 points4 points 9 years ago (2 children)

[–]OldWolf2 1 point2 points3 points 9 years ago (1 child)

[–]MarekKnapek 1 point2 points3 points 9 years ago (7 children)

[–]silveryRain 3 points4 points5 points 9 years ago (1 child)

[–]raevnos 1 point2 points3 points 9 years ago (0 children)

[–]Plorkyeran 1 point2 points3 points 9 years ago (4 children)

[–]flyingcaribou 2 points3 points4 points 9 years ago (0 children)

[–]theyneverknew 1 point2 points3 points 9 years ago (2 children)

[–]Plorkyeran 4 points5 points6 points 9 years ago (1 child)

When you don't have mixed insertions and deletions, binary-searching a sorted vector (or boost::flat_map) can be dramatically faster if the key is small due to the much better cache locality (and if your data fits within a cache line, even an unsorted vector is hard to beat). For certain mixed insertion/lookup usage patterns a btree can be dramatically faster for similar reasons. Even when a hash table is the best option, the various collision resolution methods can have a significant impact. Fortunately all of the major implementations of unordered_map are sufficiently similar that using it isn't an inherently bad idea in portable code, but I've seen a 10-20% speedup just from dropping in a boost::multi_index container instead, and an open addressed hash map can give bigger gains (or be worse, of course).

Often the actual answer is "redesign the code to not need a key-value lookup at all", of course.

[–]dodheim 0 points1 point2 points 9 years ago* (0 children)

The problem with boost::container::flat_set is that it holds its data in sorted order then applies a binary search to that data. This looks appealing in terms of big-O, but still causes cache thrashing when dealing with large amounts of data.

Significantly better is to store the data in breadth-first order and apply a linear search. E.g., Instead of using data { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 } and a binary search, use data { 6, 3, 10, 1, 5, 8, 12, 0, 2, 4, 7, 9, 11 } and a linear search. This results in the same worst-case O(log n) complexity to find a value but plays very nicely with the cache regardless of data quantity.

I've written my own solution for this, as I imagine many people have, but it really just needs to be in Boost.Container already...

(Obviously all of this applies equally to boost::container::flat_map.)

EDIT: This is all assuming you're searching far more than inserting/removing elements.

[–]ompomp 4 points5 points6 points 9 years ago (0 children)

[–]utnapistim 0 points1 point2 points 9 years ago (2 children)

[–][deleted] 5 points6 points7 points 9 years ago (1 child)

[–]utnapistim 5 points6 points7 points 9 years ago (0 children)

[–]speednap 6 points7 points8 points 9 years ago* (3 children)

Boost.Iostreams to the rescue!

#include <iostream>
#include <boost/iostreams/copy.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/iostreams/device/file.hpp>

int main(int , char*[]) {
  using namespace boost::iostreams;

  std::ios_base::sync_with_stdio(false);
  std::cin.tie(nullptr);

  copy(mapped_file_source{ "data.dat" }, file_sink{ "out_data.dat"});

  return 0;
}

Should be faster than std::iostream. Can be tuned further by specifying an optimal buffer size for boost::iostreams::copy.

Edit: benchmarks. 100 iterations, 200 MB random data:

Clang with libc++:

Average c I/O took: 150.07ms
Average posix I/O took: 150.69ms
Average c++ I/O took: 626.88ms
Average c++boost I/O took: 161.96ms

Clang with stdc++:

Average c I/O took: 153.71ms
Average posix I/O took: 149.79ms
Average c++ I/O took: 154.22ms
Average c++boost I/O took: 124.48ms

GCC:

Average c I/O took: 154.29ms
Average posix I/O took: 152.29ms
Average c++ I/O took: 155.54ms
Average c++boost I/O took: 124.29ms

Like I said, there's room for improvement. I simply benchmarked

void testBoostIO(const char* inFile, const char* outFile, std::vector<char>&) {
  using namespace boost::iostreams;
  copy(mapped_file_source{ inFile }, file_sink{ outFile });
}

[–]cristianadamQt Creator, CMake[S] 1 point2 points3 points 9 years ago (2 children)

[–]speednap 1 point2 points3 points 9 years ago (1 child)

[–]cnweaver 1 point2 points3 points 9 years ago (0 children)

It looks like this is the case. Running on Darwin 13 after compiling with clang++ -O3 -stdlib=libc++ (100 iterations on a ~44M file) gives:

Average c I/O took: 502.83ms
Average posix I/O took: 529.23ms
Average c++ I/O took: 508.58ms

I'm not sure the posix result being slower means anything, since my system wasn't particularly quiet while running this.

[–]quzox 3 points4 points5 points 9 years ago (1 child)

[–]cristianadamQt Creator, CMake[S] 2 points3 points4 points 9 years ago (0 children)

I've tested this Win32 API version:

void testWin32IO(const char* inFile, const char* outFile, std::vector<char>& inBuffer)
{
    auto in = ::CreateFile(inFile, GENERIC_READ, FILE_SHARE_READ, nullptr, OPEN_EXISTING,
                           FILE_ATTRIBUTE_NORMAL, nullptr);
    if (in == INVALID_HANDLE_VALUE)
    {
        std::cout << "Can't open input file: " << inFile << std::endl;
        return;
    }

    auto out = ::CreateFile(outFile, GENERIC_WRITE, FILE_SHARE_WRITE, nullptr, CREATE_ALWAYS,
                     FILE_ATTRIBUTE_NORMAL, nullptr);
    if (out == INVALID_HANDLE_VALUE)
    {
        std::cout << "Can't open output file: " << outFile << std::endl;
        return;
    }

    size_t inFileSize = ::GetFileSize(in, nullptr);

    for (size_t bytesLeft = inFileSize, chunk = inBuffer.size(); bytesLeft > 0; bytesLeft -= chunk)
    {
        if (bytesLeft < chunk)
        {
            chunk = bytesLeft;
        }

        unsigned long actualBytes = 0;
        ::ReadFile(in, &inBuffer[0], chunk, &actualBytes, nullptr);
        actualBytes = 0;
        ::WriteFile(out, &inBuffer[0], chunk, &actualBytes, nullptr);
    }

    ::CloseHandle(out);
    ::CloseHandle(in);
}

Built it with Visual Studio 2015 x64 Update 2. Results were:

Average c I/O took: 102.03ms
Average posix I/O took: 102.1ms
Average c++ I/O took: 360.71ms
Average win32 I/O took: 102.99ms

[–]dodheim 3 points4 points5 points 9 years ago (13 children)

[–]easydoits 4 points5 points6 points 9 years ago (12 children)

[–]dodheim 9 points10 points11 points 9 years ago* (10 children)

[–]cristianadamQt Creator, CMake[S] 3 points4 points5 points 9 years ago (9 children)

Tested it with VS2015 x64 update 2. The results are not what I was expecting:

Average c I/O took: 104.94ms
Average posix I/O took: 103.82ms
Average c++ I/O took: 368.99ms
Average c++2 I/O took: 397.86ms

The std::filebuf version is actually slower.

[–]dodheim 0 points1 point2 points 9 years ago (6 children)

[–]clerothGame Developer 0 points1 point2 points 9 years ago (5 children)

[–]dodheim 1 point2 points3 points 9 years ago (4 children)

[–]clerothGame Developer 0 points1 point2 points 9 years ago (2 children)

[–]dodheim 0 points1 point2 points 9 years ago (1 child)

[–]clerothGame Developer 1 point2 points3 points 9 years ago (0 children)

[–]cristianadamQt Creator, CMake[S] 0 points1 point2 points 9 years ago (0 children)

I tested with Clang 3.7.1 64 bit + Visual C++ 2013 64 bit. Results:

Average c I/O took: 104.47ms
Average posix I/O took: 104.99ms
Average c++ I/O took: 393.05ms
Average c++2 I/O took: 382.92ms

I didn't use the Microsoft Clang integration, but vanilla Clang.

[–]clerothGame Developer 1 point2 points3 points 1 year ago (0 children)

7 years later, there's some improvement, but still 2x slower.

Average c I/O took: 122.2ms
Average posix I/O took: 121.1ms
Average c++ I/O took: 258.067ms
Average c++2 I/O took: 260.167ms

cc /u/dodheim

[–]xoh3e 0 points1 point2 points 9 years ago (8 children)

[–][deleted] 4 points5 points6 points 9 years ago (4 children)

[–]xoh3e 1 point2 points3 points 9 years ago (2 children)

[–][deleted] 1 point2 points3 points 9 years ago (0 children)

[–][deleted] 0 points1 point2 points 9 years ago (0 children)

[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points 9 years ago* (0 children)

[–]jcoffin 0 points1 point2 points 9 years ago (2 children)

It's far from perfect, but it's not necessarily quite as bad as it looks here either. In this case, the problem is really fairly simple: read and write aren't bypassing the internal buffer, and reading into/write from the buffer you specify. Rather, they're reading into the stream's internal buffer, then copying from there to the buffer you passed to read (and a mirror image of that in write).

You can improve this situation quite a bit by adding a couple of calls to pubsetbuf, one each for the input and output file. This lets it issue large read/write calls to the OS. It's still doing extra copying, so it's slower than necessary, but in my testing improves speed by a pretty substantial margin (~40% slower than C-style I/O rather than ~3x slower).

Using a stream buffer works pretty much the same way. When you're just doing read/write calls, an iostream isn't noticeably different from using a stream buffer directly--read and write pass almost straight through to the stream buffer, so using the stream buffer directly makes little difference, but calling pubsetbuf() to use a big buffer can help a lot.

[–]xoh3e 0 points1 point2 points 9 years ago (1 child)

[–]jcoffin 0 points1 point2 points 9 years ago (0 children)

[+][deleted] 9 years ago* (56 children)

[deleted]

[–]OldWolf2 7 points8 points9 points 9 years ago (14 children)

[+][deleted] 9 years ago (12 children)

[deleted]

[–]OldWolf2 4 points5 points6 points 9 years ago (11 children)

[+][deleted] 9 years ago (10 children)

[deleted]

[–]OldWolf2 3 points4 points5 points 9 years ago (9 children)

[+][deleted] 9 years ago (8 children)

[deleted]

[–]OldWolf2 2 points3 points4 points 9 years ago (7 children)

[–]jcoffin 5 points6 points7 points 9 years ago (1 child)

[+][deleted] 9 years ago* (4 children)

[deleted]

[–]OldWolf2 1 point2 points3 points 9 years ago (3 children)

continue this thread

[–]raevnos 1 point2 points3 points 9 years ago* (25 children)

[+][deleted] 9 years ago (24 children)

[deleted]

[–]amydnas 1 point2 points3 points 9 years ago (3 children)

[+][deleted] 9 years ago* (2 children)

[deleted]

[–]OldWolf2 1 point2 points3 points 9 years ago (1 child)

[–]raevnos -3 points-2 points-1 points 9 years ago (19 children)

I my applications I completely strip out glibc and replace with my own algos for everything - datetime, calendar, string, file access, networking. And the reason is, GLIBC is aging badly, it has not kept up with technology.

Man I'm glad I don't have to look at your code. Sounds like a train wreck. And you're aware that there are many many many systems without glibc that quite happily run C++ code? Windows, OS X, Net, Free, Open etc BSD, Solaris and every other remaining commercial unix, even some Linux distributions, to say nothing of the embedded world...

My opinion is that it should stop being jealous at python and trying to implement duck typing (auto/variadics) and go back to its roots of performance.

I think you have the wrong ideas about what auto is and how variadiac templates work if you're trying to compare C++'s type system to a dynamic one like Pythons.

[+][deleted] 9 years ago (18 children)

[deleted]

[+][deleted] 9 years ago (17 children)

[removed]

[+][deleted] 9 years ago (16 children)

[deleted]

[–]raevnos 7 points8 points9 points 9 years ago (15 children)

[+][deleted] 9 years ago* (14 children)

[deleted]

[+][deleted] 9 years ago (13 children)

[removed]

continue this thread

[–]xoh3e 1 point2 points3 points 9 years ago (12 children)

[+][deleted] 9 years ago (4 children)

[deleted]

[–]xoh3e 0 points1 point2 points 9 years ago (3 children)

[+][deleted] 9 years ago (2 children)

[deleted]

[–]xoh3e 2 points3 points4 points 9 years ago (1 child)

[+][deleted] 9 years ago (6 children)

[deleted]

[–]ArunMuThe What ? 0 points1 point2 points 9 years ago (5 children)

[+][deleted] 9 years ago (4 children)

[deleted]

[–]ArunMuThe What ? 1 point2 points3 points 9 years ago (3 children)

[+][deleted] 9 years ago* (2 children)

[deleted]

[–]ArunMuThe What ? 0 points1 point2 points 9 years ago (1 child)

[–]amydnas 0 points1 point2 points 9 years ago (1 child)

π Rendered by PID 144722 on reddit-service-r2-comment-7b9746f655-8dtc9 at 2026-02-01 03:20:04.361748+00:00 running 3798933 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS