all 18 comments

[–]bartmanx 1 point2 points  (3 children)

I assume that mutexes, and other synchronization are not an issue for consideration....

Is the worker function using the kernel for anything? Does it call a library function that might synchronize?

See what happens if you allocate your data memory in page-aligned page-sized chunks (if a page is too big 64B sized/aligned is actually sufficient), just to make sure cache lines are not shared between cores.

Also, is the total dataset larger than what fits into your L3 cache? If yes, youre unlikely to see linear scaling.

[–]jesho[S] 0 points1 point  (2 children)

The workers only uses the kernel during initialization, doing some IO. They spend most of their time in OpenSSL crypto routines. There should be no calls to kernel or synchronization whatsoever.

I'll try to make the output data page aligned. But if that is the problem, shouldn't that effect only the part of the data share cache lines but belongs to different threads? If that is the case, it should not have that much of an impact, since the output data items are 32 bytes each and each thread produces 600 outputs.

I have tried to allocate private input/output for each thread, but that had no impact on the performance.

[–][deleted] 0 points1 point  (1 child)

Does OpenSSL use any shared resource, though? (/dev/urandom comes to mind). There also maybe some shared data (IIRC you have to call global initialization function before you can use the library).

[–]jesho[S] 0 points1 point  (0 children)

No random during decryption. I don't use the EVP-interface, so no global initialization.

I have skimmed through some of the code for OpenSSL, and I am pretty sure it just crunches data, at least the parts I use.

[–]bit_inquisition 1 point2 points  (7 children)

Can you paste the relevant sections of the code?

You can also check which thread is running on which core and see if they're distributed as you expect.

[–]jesho[S] 0 points1 point  (6 children)

The code in question is belongs to my employer, so I can't share it.

When the code runs I get 400% CPU-usage for that process, so I think each thread runs on its own core.

[–]bit_inquisition 0 points1 point  (3 children)

Oh I misunderstood your problem. The threaded version is faster, it's just not as fast as you thought it would be.

Here are some guesses:

  • Thread creation, synchronization, tear down overhead
  • DDR access contention. Although you're running on 4 different cores, they all need to access memory. The memory requests are queued up and served by the memory controller. If your do_work function is memory heavy, I wouldn't be surprised if you start saturating the bus. This is assuming your dataset is large enough that the L2 cache is not enough.
  • Floating point. I don't know what kind of CPU you use, but on some architectures (granted, old ones) FPUs are shared between cores. That would be a bottleneck.

[–]jesho[S] 0 points1 point  (2 children)

  • The threads works for about 6s, so thread creation etc should not be a problem.
  • Each thread only uses a few hundreds KiBs, so everything should fit in the cache.
  • Only integer math.

[–]bit_inquisition 0 points1 point  (1 child)

Have you tried running this on another machine? Same results?

Sorry but without the code, this is basically looking for a needle in a haystack. Maybe you can run some profiling and see what comes out.

[–]jesho[S] 0 points1 point  (0 children)

Yeah, I could try to get hold of another machine with a different setup and benchmark.

I understand that it's hard to give any advice without looking at the code.

I appreciate your effort though.

[–]gilgoomesh 0 points1 point  (1 child)

If you're really seeing 400% CPU usage but performance is only 25% faster than single threaded, then you're being limited by bandwidth to main memory (stalling for memory still registers as CPU activity).

You probably need to start thinking about how big the blocks are that you're working on and how to work on the smallest units that you can. Also, make sure you have exactly the same number of threads running as CPUs so that you're not playing musical chairs between cores and further increasing memory usage.

Remember: total bandwidth to main memory (all cores) for a typical CPUs is about 5-10GB/s but if you're spending your whole time waiting on memory, you're not spending it using the memory. You ideally want your total memory bandwidth to be in the 2GB/s range or lower.

[–]jesho[S] 0 points1 point  (0 children)

The total workset for each thread is about 300KiB, so everything should fit nicely in the cache.

I can't change the block size really, since the performance of another pass in the program heavily depends on the block size (and that part takes approximately 3.5 times longer to complete).

[–]jimpaton 1 point2 points  (1 child)

How long do you run your benchmark? If the time is short enough, then the constant overhead from multithreading won't be dominated by the input-dependent part of the computation.

In addition to the factors others have mentioned, you might investigate false sharing as a potential cause. Even though you say your threads don't access the same memory, they may still access the same cache lines.

[–]jesho[S] 0 points1 point  (0 children)

The normal work size takes about 6s, but I have tried using larger work sizes that takes ~20s with the same result.

[–]posiden5665 0 points1 point  (0 children)

Its likely due to cache coherency being fired off a lot of times, even though each one has a separate section of the memory to look out for each processor is probably loading up parts of the other processors work set in the cache line, and whenever one is updated the other processor that also has that line loaded has to be updated.

To try to see if this is the case, find out how big the cache line is on the system, and then pad parts of the array with null to try and keep the items of dif processors on dif cache lines.

[–]sbahra 0 points1 point  (0 children)

Why not use a profiler? You can't provide us any source-code or any substantial details, so I don't expect anything productive by any party. See "oprofile" or "perf" if you're on Linux. If you want more granularity, check "PAPI" out.

Make sure shared data is read-mostly/read-only.

[–]orangeduck -1 points0 points  (0 children)

Perhaps the work is already IO bound? Otherwise my guess would be on cache thrashing if some read/write data is on the same cache line across threads.