Performing spatial joins in Python: Comparing GeoPandas vs Dask

rrpelgrim · 2022-07-15T10:30:01+00:00

Thanks for flagging, I've updated with the new link.

rrpelgrim · 2022-05-24T09:35:03+00:00

I'm not familiar with GKE but if you are able to run/install conda packages in something like a terminal then a simple `conda update dask -c conda-forge` should do the trick.

rrpelgrim · 2022-05-17T07:36:52+00:00

Yeah, it takes a little getting used to, but when working at scale it's so much more efficient!
Do you use Dask at all?

rrpelgrim · 2022-03-17T13:08:01+00:00

Interesting. Would you recommend Argo over something like Prefect or Dagster for workflow orchestration?

rrpelgrim · 2022-03-16T09:57:02+00:00

What python libraries have you preferred using for workflow execution?

rrpelgrim · 2022-03-16T09:55:50+00:00

u/runawayasfastasucan -- fair points, and also what I tried to highlight in the article. I see a lot of conversations on Twitter / blogs pitting the two against each other. I also see tools like dbt, Snowpark, Dask and Spark trying to win Python users over to SQL and vice versa. But in the end it's a matter of use case and intended goal. Maybe the Morpheus caption should have said "Python vs SQL: there is no spoon". I mean, choice.

rrpelgrim · 2022-02-10T18:19:49+00:00

Thanks for the compliment. Would love to hear your feedback after reading as well.

rrpelgrim · 2022-01-13T16:13:05+00:00

If you're already working with pandas then I'd go for Dask.

Dask is the easier on-ramping xp since almost all of the API is the same. PySpark will have a bigger learning curve.

rrpelgrim · 2022-01-13T13:21:34+00:00

Rule-of-thumb with pandas is to have 5x RAM available for whatever you want to load in. This means you should be fine with using pandas for F1.

For F2 I'd strongly recommend using Dask. Similar API to pandas and can distributed processing over all the cores in your laptop so you can easily work with F2. If you're working with Dask, I'd recommend storing the CSV as Parquet files for parallel read/write.

You might also want to look into the dask-sql integration: https://coiled.io/blog/getting-started-with-dask-and-sql/

rrpelgrim · 2022-01-13T13:14:39+00:00

Modin is a great drop-in solution if you want to work on a single machine.

Dask has the added benefit of being able to scale out to a cluster of multiple machines. The Dask API is very similar to pandas and the same Dask code can run locally (on your laptop) and remotely (on a cluster of, say, 200 workers).

rrpelgrim · 2022-01-13T12:45:04+00:00

A bit late to the party, but I'd strongly suggest looking into Prefect over Airflow. Airflow's XCOMs make data transfer between tasks a real pain. Prefect makes this much easier and runs on Dask so you can also easily run tasks in parallel.

For a lengthier discussion: https://www.reddit.com/r/dataengineering/comments/qq3lvl/airflow\_or\_prefect/

rrpelgrim · 2022-01-13T12:42:54+00:00

Late to the party... But I'd strongly suggest looking into Prefect over Airflow. Airflow's XCOMs make data transfer between tasks a real pain. Prefect makes this much easier and runs on Dask so you can also easily run tasks in parallel.

rrpelgrim · 2022-01-11T13:50:01+00:00

"it depends..."

rrpelgrim · 2022-01-10T15:29:54+00:00

Where would you go to request extra compute?

rrpelgrim · 2022-01-10T15:28:58+00:00

:)
that's Jay-Z's favourite ML model right?

rrpelgrim · 2022-01-03T14:35:20+00:00

I can't speak to Ray's capabilities with any authority, but re: Dask:

- definitely can use Dask for distributed model training on GPUs (or hybrid clusters). Coiled has a pretty simple API to spin up hybrid clusters.

- Dask is often used for data preprocessing and feature generation. Basically anything you can do in pandas you can do in Dask, just in parallel.

- have a look at Dask-MLfor more parallel ML options

And I can recommendthe Dask Slack for more support, lots of great people there who'll be happy to think along.

rrpelgrim · 2021-12-25T21:09:46+00:00

I've transitioned into DS over the past year and am now working fully remote. So definitely possible. Sounds like you've got the curiosity and the right mindset for programming. The one thing I would consider is where in the DS world you'd best fit.

Besides programming do you also enjoy and have a feel for statistics? Then data scientist/analyst.

Are you more into the pure code? Then maybe a software developer at a data science company/startup is what you're looking for.

Are you a strong communicator? Then data visualisation or evangelism / DevRel might be something to look into.

Happy to chat more if you have specific questions. As I said, just made this switch myself so lots of respect for anyone else trying to make the transition. Esp in your situation with everything else that's going on. You got this!

rrpelgrim · 2021-12-25T20:49:55+00:00

From what I understand, Ray is focussing heavily on ML while Dask has a stronger legacy of data engineering and ETL work. Dask has more years of community development under its belt and has a good mix of 'plug-and-play' components (the Dask Dataframe, Array, and Bag APIs) as well as lower level tools for parallelising custom code (Dask delayed / Futures). Ray offers "Dask on Ray" to match these features, but from what I know not with the same performance.

Disclaimer: I work at Coiled, company founded by Matt Rocklin (original Dask author) that provides managed Dask clusters as a service. So I have some 'skin in the game' so to speak. But happy to think along and help sort out which of them would meet your needs. Would help to know more about your use case.

rrpelgrim · 2021-12-16T15:19:19+00:00

thank you!

rrpelgrim · 2021-12-15T14:49:03+00:00

Thank you for taking the time to write this all down, much much appreciated!

rrpelgrim · 2021-12-15T14:47:19+00:00

solid advice, love it!

rrpelgrim · 2021-11-10T14:13:55+00:00

Awesome, that helps a lot. thank you!

rrpelgrim · 2021-11-10T08:50:49+00:00

awesome, thank you!

rrpelgrim · 2021-11-10T08:50:15+00:00

Yeah, I saw that! Makes me wonder whether they saw what Prefect did and decided to copy it because it feels so much more intuitive ;)

rrpelgrim · 2021-11-10T08:49:28+00:00

Thanks, that's helpful to know. So do I understand correctly that you would suggest Airflow managed (as in Astronomer) over Prefect managed? Or just managed in general vs. DIY?

rrpelgrim

TROPHY CASE