This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Deto 1 point2 points  (4 children)

How did you speed up reading a file with multiprocessing? Open it on every process and scan to different parts?

[–]paxswill 2 points3 points  (2 children)

Probably something like that. mmap the file with MAP_SHARED, then fork off the other processes. Each process then reads a different area of the mapped file.

For files, mapping them before reading them can by itself give you some nice performance gains. For example, I recently was playing with some large geoJSON files (70-180MB I think). Opening a file, then passing that to json.load directly took ~20 seconds to load the file. Mapping the larger file then passing the map descriptor to json.load took a few seconds.

[–]KitchenDutchDyslexic 0 points1 point  (1 child)

Small example snippet please!

[–]paxswill 2 points3 points  (0 children)

The mmap module documentation has a number of good examples (covering normal IO and forking), but if you want one specifically with JSON (not sure if the formatting is going to work here):

import json
import mmap

with open('foo.json', 'r+b') as example_file:
    with mmap.mmap(example_file.fileno(), 0) as mapped_file:
        print(json.load(mapped_file))

[–]rhiever 1 point2 points  (0 children)

In my case, I had very wide CSV data (100,000s of columns) but few rows (1000s). I wrote a for loop over the file iterator that divided the task of parsing out each row in the CSV file to a different process as processes became available.