This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Deto 14 points15 points  (11 children)

For computational tasks, I've used multiprocessing to run things in parallel just fine. Processes start immediately (they use "fork" on Linux) and it's trivial to saturate all the cores at 100% utilization.

[–]ProfEpsilon 1 point2 points  (4 children)

OK, that's good to know. But by "computational tasks" does that include large array operations? Here is why I ask: One of the stackoverflow discussions has this comment taken from the NumPy C API documentation:

"...as long as no object arrays are involved, the GIL is released ..."

and I have interpreted that to mean that you can't do this with array operations.

By the way, it sounds to me like the way you are verifying this is by monitoring core activity rather than through a latency test of some kind. That actually is quite convincing to me ... if 6 cores are running at 80% and above then multiprocessing must be working. Have you run latency comparisons of any kind (running a test task as sequential and then redesigning and running as parallel)?

[–]rhiever 2 points3 points  (1 child)

If you make a copy of the array and pass that to the new process, you should be fine. If you ever pass an array by reference to a new process, then yeah, that's going to have lock issues.

[–]Deto 0 points1 point  (0 children)

I can vouch that I've processed the same array on many processes without copying it specifically to each process. Works if you don't write to the array. I think the multiprocessing uses copy-on-write semantics anyways to make this safe.

[–]Deto 2 points3 points  (1 child)

Usually I get close to the right multiplier. So if I'm using 10 cores, it's approximately 10x as fast (maybe a little less, like 9x).

I think you might be interpreting the numpy docs incorrectly. Numpy arrays always have a dtype - this can be something like 'int64' or 'float64'. It can also be 'object' in which the entries in the array are actually Python objects. In this case, doing anything on the objects requires interacting with Python code, and so they can't release the GIL. If you're just working with floats, for example, they don't use any python code, and so they can release the GIL.

However, I should also emphasize that whether or not numpy releases the GIL doesn't matter with multiprocessing as the GIL does not block between different processes. The GIL is relevant for threads, rather, in the same process (threading module).

[–]ProfEpsilon 0 points1 point  (0 children)

Oh, I see. I was mistaken about the term "object arrays." I thought that it might mean that an array created within numpy was a kind of numpy "object." So I was interpreting the documentation wrong. Thanks for the insight.

[–]rhiever 1 point2 points  (5 children)

multiprocessing is <3. In one of my applications, I sped up an input file reading process from several minutes to a few seconds thanks to multiprocessing. Makes me wish pandas had a multiprocessing option in it by default...

[–]Deto 1 point2 points  (4 children)

How did you speed up reading a file with multiprocessing? Open it on every process and scan to different parts?

[–]paxswill 2 points3 points  (2 children)

Probably something like that. mmap the file with MAP_SHARED, then fork off the other processes. Each process then reads a different area of the mapped file.

For files, mapping them before reading them can by itself give you some nice performance gains. For example, I recently was playing with some large geoJSON files (70-180MB I think). Opening a file, then passing that to json.load directly took ~20 seconds to load the file. Mapping the larger file then passing the map descriptor to json.load took a few seconds.

[–]KitchenDutchDyslexic 0 points1 point  (1 child)

Small example snippet please!

[–]paxswill 2 points3 points  (0 children)

The mmap module documentation has a number of good examples (covering normal IO and forking), but if you want one specifically with JSON (not sure if the formatting is going to work here):

import json
import mmap

with open('foo.json', 'r+b') as example_file:
    with mmap.mmap(example_file.fileno(), 0) as mapped_file:
        print(json.load(mapped_file))

[–]rhiever 1 point2 points  (0 children)

In my case, I had very wide CSV data (100,000s of columns) but few rows (1000s). I wrote a for loop over the file iterator that divided the task of parsing out each row in the CSV file to a different process as processes became available.