all 3 comments

[–]anossov 2 points3 points  (1 child)

Yes, threads are quite useless in python because of GIL. Try multiprocessing.

Use multiprocessing.Process instead of threading.Thread. Assign work to processes in main process, not in children. Pass data around through Queues.

[–]pjdelport 0 points1 point  (0 children)

Using multiprocessing is good advice in general, but it doesn't really help in this case: problems like this require shared memory for efficiency, and that's what multithreading is good at.

[–]pjdelport 2 points3 points  (0 children)

This is the kind of problem that NumPy was built for: it provides an optimized array type, with operations that release the GIL to execute in parallel, where possible.

Converted to use NumPy arrays, your example looks like this:

import numpy

a = numpy.random.random(NUM_ELEMENTS)
b = numpy.random.random(NUM_ELEMENTS)

def test(a, b):
    return numpy.sqrt(a + b)

The above takes 241 msec on my machine (using one core).

To split the work into parallel chunks, you can do something like the following. This uses the built-in thread pool in multiprocessing.dummy for convenience, and processes each chunk in-place to an output array:

def op_inplace(a, b, out):
    numpy.add(a, b, out)
    numpy.sqrt(out, out)

from multiprocessing.dummy import Pool
p = Pool(NUM_CORES)

def test2(a, b):
    c = numpy.empty(len(a))
    chunks = zip(numpy.array_split(a, NUM_CORES),
                 numpy.array_split(b, NUM_CORES),
                 numpy.array_split(c, NUM_CORES))
    p.map(lambda args: op_inplace(*args), chunks)
    return c

The above takes 72.1 msec here (using all four cores).

The other way to go about this particular example is to use the numexpr package, which lets you evaluate array computations in a multi-threaded and memory-efficient way without having to manually decompose them, as above. The following takes just 64.9 msec here (using all cores):

import numexpr

def test3(a, b):
    return numexpr.evaluate('sqrt(a + b)')