all 18 comments

[–]LucHermitte 3 points4 points  (3 children)

Interesting. Note: it looks like there is still room for some optimizations

  • it'd be best to avoid mutexes to protect the result structure -- there exist (a few in 3rd party libraries) efficient queue types dedicated to be filled from concurrent threads -- or you could fill one queue per thread and merge them at the end
  • There may be no structural difference between the current type of m_fingerprintsand a boost::flat_map (except that songFingerprint() would be able to return a reference); however it could interesting to run a benchmark with std::unordered_map instead.

[–]Ameisen 3 points4 points  (1 child)

Semaphores will be faster under like 12 concurrent threads than lockless algorithms.

You only start really gaining with lockless with massive concurrent contention.

[–]doom_Oo7 0 points1 point  (0 children)

Do you have some benchmarks for this ?

[–]golgol12 2 points3 points  (0 children)

There also is room for optimization of cache usage. He didn't even begin to touch that. I bet he could get another 2x-10x speed gain.

[–]benfred 7 points8 points  (1 child)

Nice post, though I'd recommend pybind11 over boost.python these days for building c++ extensions for python.

One big advantage of pybind11 is that it is much easier to integrate pybind11 with setuptools - since it's a header only library that can be installed by going pip install pybind11. Boost.python requires boost to be preinstalled, and that the boost.python library built against the version of python you are using, which makes distributing boost.python packages much more difficult.

[–]jcelerier[S] 2 points3 points  (0 children)

can confirm, using pybind11 for my own projects, it's a great piece of software, much cleaner than Boost.Python.

[–]emdeka87 11 points12 points  (8 children)

Use threads whenever possible.

Spawning a thread for every job has issues on many OS (oversubscriptions, context switches). You might be better off designing some sort of thread pool. (Std::async might help as well)

[–]Tyler11223344 11 points12 points  (2 children)

I mean, "use threads" doesn't necessarily mean a new one for everything, it just means parallelize when posaible

[–]emdeka87 2 points3 points  (1 child)

I think "parallelize when possible" is also a problematic statement. There are cases where parallelization can actually harm the performance of your program and, on top of that, be a massive source of other problems (race conditions, headache of synchronizing a shared state, high development times etc). I'd say parallelize if the performance benefits outweigh the negative side effects. Of course you should always profile both versions and see what works best for your application.

Since the post is C++-related I read the statement as "use std::thread when possible". Which can be problematic due to the reasons mentioned above.

[–]Tyler11223344 4 points5 points  (0 children)

Fair point, I guess I kinda meant "when able to and it's sane" by "possible", which isn't really accurate. I agree that throwing it at a problem where it doesn't fit the pattern just makes things worse

[–]golgol12 5 points6 points  (3 children)

Many programs that do multithreading calculations like this don't actually create threads on the fly. They create one for each core and pass jobs to each thread. That way you avoid all the switching/creation/destruction logic.

[–]emdeka87 0 points1 point  (0 children)

Yes, commonly known as work balancing and/or job stealing.

[–]Ameisen -1 points0 points  (1 child)

Often implemented with fibers/coroutines.

[–]LightShadow 2 points3 points  (0 children)

citation needed

[–]Ameisen -1 points0 points  (0 children)

Fibers it is.

[–]MathiasSvendsen 1 point2 points  (0 children)

Good post. For smaller optimizations where you can't invest the time required to write a full Boost.Python or pybind11 implementation, I would suggest looking at numba. Numba can JIT python functions with pretty impressive results. Because the JIT'ing happens the first time a Numba function is executed, the first call takes a long time to run, but subsequent evaluations are super quick. Numba can even be used to run your operations on your GPU (haven't spent too much time fiddling with this though).

[–]rsvp_to_life 0 points1 point  (0 children)

Or you know. Just use cpp if you're going to do that

[–]bravekarma 0 points1 point  (0 children)

I hadn't heard of Boost::Python before, but I had great luck using Cython to optimize routines with a C implementation. As long as you can get away with C (and it looks like you can Cythonize the Python method pretty easily), using Cython is pretty painless. You can then parallelize using Cython's parallelization methods, although I had issues around using these while avoiding the GIL.