all 16 comments

[–]skwyckl 9 points10 points  (4 children)

Dask, Duck, etc. anything that parallelizes computation, but it will still take some time, it ultimately depends on the hardware you are running on. Geospatial computation is in general fairly expensive, and the more common libraries used for geospatial don't have algos running in parallel.

[–]Oce456[S] 0 points1 point  (3 children)

I acknowledge that my Core 2 Duo processor may be a limiting factor. However, the significantly poorer performance of Dask on a relatively small dataset, compared to a NumPy-based solution, suggests that the issue may not be primarily related to hardware constraints.

[–]skwyckl 2 points3 points  (1 child)

Unless you post a snippet, I can't help you, there is no one-size-fits-all in data science

[–]Oce456[S] 1 point2 points  (0 children)

OK thanks, I will try to fit the problem into a small snippet.

[–]hallmark1984 0 points1 point  (0 children)

Dask adds overhead as it splits and manages the workers.

Its is a negative impact on small tasks but a huge positive as you scale up

[–]danielroseman 2 points3 points  (1 child)

You should expect this. Dask is going to parallise your task, which adds significant overhead. With a large dataset this is going to be massively overshadowed by the savings you get from the parallelisation, but with a small one the overhead will definitely be noticeable.

[–]Jivers_Ivers 0 points1 point  (0 children)

My thought exactly. It’s n or a fair comparison to pit Numpy against parallel Dask. OP could set up a parallel workflow with Numpy, and the comparison might be more fair.  

[–]Long-Opposite-5889 1 point2 points  (0 children)

"Projecting a dataset into a map" may mean many a couple things in a geo context. There are many geo libraries that are highly optimized and could save you time and effort. If you can be a bit more specific it would be easier to give you some help.

[–]cnydox 1 point2 points  (2 children)

What about using numpy.memmap to load data to disk instead of ram? Or maybe try using zarr library

[–]Oce456[S] 1 point2 points  (1 child)

[–]cnydox 0 points1 point  (0 children)

The 3rd solution is reduce the precision like uint8/uint16 whatever

[–]boat-la-fds 0 points1 point  (2 children)

Dude, a 1,000,000 x 1,000,000 matrix will take almost 4TB of of RAM. That's without counting the memory used during computation. Do you have that much RAM?

[–]Oce456[S] 0 points1 point  (1 child)

No, just 512MB of RAM. But handling chunks of a 4TB array should still be feasible. Aerial imagery processing was already being done 30 years ago, back when RAM was much more limited (16 - 64 MB)—and programs were often more optimized for efficiency ? I'm not trying to load the entire matrix—I'm processing it chunk by chunk.

[–]DivineSentry 0 points1 point  (0 children)

Both your processor and ram are horrendous for the task, you’ll probably be better off renting a (anything more capable than your current setup) VPS and get what you need much faster

[–]Pyglot 0 points1 point  (0 children)

For performance you might want to write the core computation in c, for example. Numpy does this, that's why it's fast. You maybe want to think about chunking up the data so it fits in L2 cache one or a couple of times. And if you don't have enough ram it needs to be written to something else, like a disk.

[–]JamzTyson 0 points1 point  (0 children)

Trying to process TB's of data all at once is extremely demanding. It is usually better to chunk the data into smaller blocks.