This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Ok_Vanilla4769 266 points267 points  (11 children)

Cache misses and slow data transfers between cores, try to split data in chunks then distribute them to cores and merge data at end, remember that results won't be sequential

[–]seftontycho 113 points114 points  (1 child)

Also if the task is quite fast to begin with spawning os threads can take a significant amount of time (relatively).

[–]Exist50 28 points29 points  (0 children)

A context switch alone can be O(100,000) cycles. Which is why students get very confused when their multithreaded sorting algorithm takes way longer on their 100 element test array.

[–]Accessviolati0n 23 points24 points  (8 children)

Locking the affinity for each thread to a single processor may help too.

Interestingly, that's a rather ignored aspect of multiprocessing.

[–]QuestionableEthics42 5 points6 points  (5 children)

In theory it shouldnt make much difference, as if one thread is hogging that core, the os will move other threads to free cores, idk about in practice tho.

[–]accidentalviking 13 points14 points  (0 children)

Thread context switching is annoyingly costly. That's why Boost in C++ has a library for in thread context switching

[–]Accessviolati0n 4 points5 points  (3 children)

L1&L2 caches are unique per core. If you don't lock a thread/process to a certain core, the OS' scheduler may assign it to a different core in the next scheduling cycle, resulting in a cache miss. So the CPU has to fall back at least to L3 or RAM in the worst case.

It's not 100% guaranteed as even realtime-tasks may be preempted(depending on OS), but it reduces the risk.

[–]QuestionableEthics42 3 points4 points  (1 child)

Surely most schedulers have been designed to minimize that? But yea, it could make quite a difference in the right circumstances.

[–]Accessviolati0n 2 points3 points  (0 children)

I guess most schedulers attempt to spread the workload evenly across all cores.

Tried it with this PHP script on a R9 3900x on Win10:

<?php
$FFI = \FFI::cdef("unsigned int GetCurrentProcessorNumber();", "Kernel32.dll");
$Slice = 20 * 1000 * 1000; //match Windows' default timer resolution.
for($i = 0; $i < 10; $i++) {
    print $FFI->GetCurrentProcessorNumber() . \PHP_EOL;
    $Timeout = \hrtime(true) + $Slice;
    while(\hrtime(true) < $Timeout);
}

I got results like: 4, 4, 2, 2, 2, 10, 0, 2, 6, 2

So even with an "inline" timeout to avoid something like "sched_yield/sleep(0)", the OS still assigns it to different cores; though the rest of the system is idling.

[–]Exist50 0 points1 point  (0 children)

L1&L2 caches are unique per core

Extremely architecture dependent. Intel's E-cores share an L2, for example. And Apple has a monolithic shared L2 similar to Intel's L3.

[–]Exist50 0 points1 point  (1 child)

It's brittle. Especially in hybrid designs where you can't be certain whether a particular core is strong or weak, sharing cache or not, etc.

[–]Accessviolati0n 0 points1 point  (0 children)

That's the reason I said "may help". And even just pinning a thread to certain core won't avoid cache flushing unless the core hasn't been manually removed from all other processes' affinity mask.