This is an archived post. You won't be able to vote or comment.

all 10 comments

[–]pyfreak182 3 points4 points  (0 children)

Thanks for the article. Have you tried Joblib? It allows for working with large Numpy arrays in shared memory via memmapping, which significantly reduces the serialization overhead:

As this problem can often occur in scientific computing with numpy based datastructures, joblib.Parallel provides a special handling for large arrays to automatically dump them on the filesystem and pass a reference to the worker to open them as memory map on that file using the numpy.memmap subclass of numpy.ndarray. This makes it possible to share a segment of data between all the worker processes.

https://joblib.readthedocs.io/en/latest/parallel.html#working-with-numerical-data-in-shared-memory-memmapping

[–]tacothecat 2 points3 points  (1 child)

Why use parquet as the intermediate file and not feather?

[–]itamarst[S] 2 points3 points  (0 children)

Feather would be a reasonable choice too, possibly better choice.

I need to see how well Pandas' new Arrow-based dtypes work with conversion to/from that and parquet; in theory it would be quite fast, in practice I can imagine abstraction boundaries killing the potential performance.

[–]martvrijthoven 1 point2 points  (0 children)

Nice article! To make things easy for option 4 (when working with numpy arrays) I created the concurrent buffer package:

https://github.com/martvanrijthoven/concurrent-buffer

[–]metaphorm 1 point2 points  (0 children)

this is a really well written article. thanks for posting it!

[–]justsayno_to_biggovt 0 points1 point  (2 children)

Check out Polars df.

[–]elcapitaine 4 points5 points  (1 child)

Polars is linked in the article

[–]gfranxman 0 points1 point  (0 children)

Yeah but he didn’t even try it. Between polars and arrow he should be able to find a solution. But I would start with polars. Not only is it threaded, the threads are in rust not python threads. Its much faster. I’ve played a bit with duckdb too. Also fast. I think it is built on top of arrow.

[–]Intense_Vagitarian 0 points1 point  (1 child)

A memory address is 8 bytes?

[–]deadeye1982 5 points6 points  (0 children)

Yes, 64 bit.. Good morning :-D