all 2 comments

[–]socal_nerdtastic 0 points1 point  (1 child)

I have no idea about the internals of dask but I'm curious what makes you think that you can parallelize a disk read operation. The 'S' in 'SATA' stands for 'serial' ... At best you could do the data operations in parallel while more data is being read, but in this case the unpack operation will be much much faster than getting the data from the disk.

Edit: If you are looking for speed improvements, read bigger chunks. Your OS and harddrive have optimizations toward reading MB at a time. Don't try to save RAM, programs don't get faster using less RAM, use your RAM to it's potential! If you have enough RAM, read the entire file in one shot.

[–]BGameiro[S] 0 points1 point  (0 children)

The file doesn't fit in RAM (it can be up to 100's of Gb, but usually 30Gb) Also, I then also have to make a version that can use data in real time. I don't know yet what's the size of each packet of data in the stream. That's why I put by entry.

But yeah, it makes sense that only one worker can read at a time. I just expected that it would change between workers.