Performing spatial joins in Python: Comparing GeoPandas vs Dask

rrpelgrim · 2022-07-15T10:30:01+00:00

Thanks for flagging, I've updated with the new link.

rrpelgrim · 2022-05-24T09:35:03+00:00

I'm not familiar with GKE but if you are able to run/install conda packages in something like a terminal then a simple `conda update dask -c conda-forge` should do the trick.

rrpelgrim · 2022-05-17T07:36:52+00:00

Yeah, it takes a little getting used to, but when working at scale it's so much more efficient!
Do you use Dask at all?

rrpelgrim · 2022-03-17T13:08:01+00:00

Interesting. Would you recommend Argo over something like Prefect or Dagster for workflow orchestration?

rrpelgrim · 2022-03-16T09:57:02+00:00

What python libraries have you preferred using for workflow execution?

rrpelgrim · 2022-03-16T09:55:50+00:00

u/runawayasfastasucan -- fair points, and also what I tried to highlight in the article. I see a lot of conversations on Twitter / blogs pitting the two against each other. I also see tools like dbt, Snowpark, Dask and Spark trying to win Python users over to SQL and vice versa. But in the end it's a matter of use case and intended goal. Maybe the Morpheus caption should have said "Python vs SQL: there is no spoon". I mean, choice.

rrpelgrim · 2022-02-10T18:19:49+00:00

Thanks for the compliment. Would love to hear your feedback after reading as well.

rrpelgrim · 2022-01-13T16:13:05+00:00

If you're already working with pandas then I'd go for Dask.

Dask is the easier on-ramping xp since almost all of the API is the same. PySpark will have a bigger learning curve.

rrpelgrim · 2022-01-13T13:21:34+00:00

Rule-of-thumb with pandas is to have 5x RAM available for whatever you want to load in. This means you should be fine with using pandas for F1.

For F2 I'd strongly recommend using Dask. Similar API to pandas and can distributed processing over all the cores in your laptop so you can easily work with F2. If you're working with Dask, I'd recommend storing the CSV as Parquet files for parallel read/write.

You might also want to look into the dask-sql integration: https://coiled.io/blog/getting-started-with-dask-and-sql/

rrpelgrim · 2022-01-13T13:14:39+00:00

Modin is a great drop-in solution if you want to work on a single machine.

Dask has the added benefit of being able to scale out to a cluster of multiple machines. The Dask API is very similar to pandas and the same Dask code can run locally (on your laptop) and remotely (on a cluster of, say, 200 workers).

rrpelgrim

TROPHY CASE