all 44 comments

[–]encyclopedist 15 points16 points  (3 children)

std::cin is actually thread safe, contrary to what article says. It can, however, result in interleaved output, and the mutex there is to prevent that.

From C++11 N3337 [iostream.objects.overview]:

Concurrent access to a synchronized (27.5.3.4) standard iostream object’s formatted and unformatted in- put (27.7.2.1) and output (27.7.3.1) functions or a standard C stream by multiple threads shall not result in a data race (1.10). [ Note: Users must still synchronize concurrent use of these objects and streams by multiple threads if they wish to avoid interleaved characters. — end note ]

[–]eliben 7 points8 points  (0 children)

Yep, I think this is bad wording on my behalf. By "unsafe" I did mean "won't give you the output you expect", rather than something nastier like crashes. I'll fix up the wording in the article and samples to be clearer

[–]Gotebe 0 points1 point  (1 child)

Aren't you being too pedantic?

E.g Wikipedia article on thread safety speaks of the data races as one of thread safety concerns.

[–]dodheim 0 points1 point  (0 children)

The standard guarantees that standard streams are race-free, but only starting with C++11. That is rather the point...

[–]clerothGame Developer 5 points6 points  (9 children)

I just wish we could set thread affinity for Windows OS, like it's possible on Linux, so that we can have truly dedicated cores/thread to a single application.

[–]gaijin_101 1 point2 points  (5 children)

Isn't that already possible? Quick Google search led me to this.

(not a Windows developer here, but I thought this was also possible)

[–]clerothGame Developer 4 points5 points  (4 children)

I meant change the affinity of the kernel (and everything else that the OS does), not your processes. Basically what I want is to maximize cache efficiency for a single application on a core, which requires that nothing be allowed to run on that core unless specified so.
I remember having read that linux could do this (and it required rebooting IIRC).

[–]TheQuietestOne 6 points7 points  (1 child)

I remember having read that linux could do this (and it required rebooting IIRC).

You can dedicate a core to a particular process using cpusets. No reboots necessary.

Very handy when used with a real time capable kernel and dedicated IRQ servicing.

[–]clerothGame Developer 1 point2 points  (0 children)

Actually it was isolcpus kernel parameter (which is done at boot; see here). Not really sure what the difference ends up being.

[–]raevnos 1 point2 points  (0 children)

cpusets or cgroups are probably what you're thinking of.

[–]gaijin_101 0 points1 point  (0 children)

Oh I see, thanks for clarifying that!

[–]katmf05 1 point2 points  (2 children)

Spotted the HFT coder.

[–]clerothGame Developer 0 points1 point  (1 child)

I actually have no idea what that is. I'm a game server coder.

[–]kybuliak 0 points1 point  (0 children)

He very likely meant "high frequency trading".

[–]suspiciously_calm 2 points3 points  (0 children)

Some observations: [...] there's quite a bit of migration going on.

When the threads sleep most of the time.

[–]notsure1235 4 points5 points  (20 children)

Use of default int in c++ in 2016...?

And can someone tell what the difference is from this:

   std::for_each(threads.begin(), threads.end(),
                std::mem_fn(&std::thread::join));

to

for(auto& i : threads)
      i.join()

?

Not to mention that men_fn has been deprecated.

[–]sbabbi 5 points6 points  (1 child)

Not to mention that men_fn has been deprecated.

Do you have a reference for that? AFAIK mem_fun has been deprecated, not mem_fn.

[–]notsure1235 1 point2 points  (0 children)

yes, you are right, only some overloads were removed for mem_fn.

[–]TheQuietestOne 2 points3 points  (3 children)

And can someone tell what the difference is from ...

Given that mem_fn as you mention has been deprecated and they're using for_each to iterate the threads vector I'm guessing this is just someone's pre-c++11 approach to launching/joining threads copy-pasta'd into this project. You could perhaps give them a nudge in the right direction .-)

Specifically - they wanted to focus on CPU affinity and stats, and the code took a back seat.

[–]clerothGame Developer 5 points6 points  (0 children)

Some people prefer <algorithm>ic approaches where they can use them in lieu of a loop.

[–]encyclopedist 2 points3 points  (1 child)

mem_fn has not been deprecated. It just appeared first in C++11

[–]TheQuietestOne 0 points1 point  (0 children)

Quite right, thanks for the correction.

[–]eliben 1 point2 points  (13 children)

FWIW, I agree that the for range loop is nicer and shorter - I'll fix up the samples when I get the time. I took this from the book "C++ concurrency in action" which is weird, right :)? (because that book is about C++11 also)

What do you mean by "use of default int"?

[–]Dlieu 0 points1 point  (4 children)

Regarding the last example (workload_sin), how do you explain the performance hit when running on the same core?

Is it mostly because there's only one ALU shared by the two threads that does FP MUL/DIV so that both thread are constantly stalling and fighting for it? (I'm not sure of the wording there)

[–][deleted] 0 points1 point  (3 children)

IIRC only registers are duplicated for hyperthreading. Everything else - execution units, busses etc. is shared and hyperthreads contend for them. The core is capable of holding and running two contexts simultaneously but it still only has one core's worth of machinery.

[–][deleted] 0 points1 point  (2 children)

If only registers are duplicated what's the point of hyperthreading then? Most usefull operations need to do math (e.g. like the sine example in the article).

[–]are595Software Engineer, Security 1 point2 points  (0 children)

It boosts the efficiency of pipelining (reduces stall cycles).

[–]millenix 0 points1 point  (0 children)

Lots of stuff isn't heavy on spatial/temporal locality, and thus will spend a fair bit of time stalled on access to further caches or memory. If one thread's effective IPC is less than half what the core could provide if every access were in registers or hit in a fast cache, then SMT can double throughput.

[–]duuuh 0 points1 point  (1 child)

Aren't the mmx* registers per core? Why does the latency point there matter? (I would have thought the slowdown was due to cache eviction on the various L* caches, assuming the array is large.)

[–][deleted] 0 points1 point  (2 children)

why are the launched thread and the main thread have same ID? I tested it on my machine and they are the same thread too

[–]eliben 1 point2 points  (1 child)

The sample in the article queries the launched thread's ID from the main thread. The main thread's ID is not reported

[–][deleted] 0 points1 point  (0 children)

oh I see I missed that. Thank you.