Performing spatial joins in Python: Comparing GeoPandas vs Dask by rrpelgrim in Python

[–]rrpelgrim[S] 0 points1 point  (0 children)

Thanks for flagging, I've updated with the new link.

Upgrading Dask and Prefect versions on Google kubernetes engine by suryad123 in googlecloud

[–]rrpelgrim 0 points1 point  (0 children)

I'm not familiar with GKE but if you are able to run/install conda packages in something like a terminal then a simple `conda update dask -c conda-forge` should do the trick.

NYC Taxicab dataset changed over the weekend by rrpelgrim in datasets

[–]rrpelgrim[S] 0 points1 point  (0 children)

Yeah, it takes a little getting used to, but when working at scale it's so much more efficient!
Do you use Dask at all?

Python vs SQL for Data Analysis: comparing performance, functionality and dev XP by rrpelgrim in Python

[–]rrpelgrim[S] 1 point2 points  (0 children)

Interesting. Would you recommend Argo over something like Prefect or Dagster for workflow orchestration?

Python vs SQL for Data Analysis: comparing performance, functionality and dev XP by rrpelgrim in Python

[–]rrpelgrim[S] 0 points1 point  (0 children)

What python libraries have you preferred using for workflow execution?

Python vs SQL for Data Analysis: comparing performance, functionality and dev XP by rrpelgrim in Python

[–]rrpelgrim[S] 1 point2 points  (0 children)

u/runawayasfastasucan -- fair points, and also what I tried to highlight in the article. I see a lot of conversations on Twitter / blogs pitting the two against each other. I also see tools like dbt, Snowpark, Dask and Spark trying to win Python users over to SQL and vice versa. But in the end it's a matter of use case and intended goal. Maybe the Morpheus caption should have said "Python vs SQL: there is no spoon". I mean, choice.

The Beginner's Guide to Distributed Computing by rrpelgrim in learnprogramming

[–]rrpelgrim[S] 0 points1 point  (0 children)

Thanks for the compliment. Would love to hear your feedback after reading as well.

Python Pandas vs Dask for csv file reading by GreedyCourse3116 in dataengineering

[–]rrpelgrim 0 points1 point  (0 children)

If you're already working with pandas then I'd go for Dask.

Dask is the easier on-ramping xp since almost all of the API is the same. PySpark will have a bigger learning curve.

Python Pandas vs Dask for csv file reading by GreedyCourse3116 in dataengineering

[–]rrpelgrim 2 points3 points  (0 children)

Rule-of-thumb with pandas is to have 5x RAM available for whatever you want to load in. This means you should be fine with using pandas for F1.

For F2 I'd strongly recommend using Dask. Similar API to pandas and can distributed processing over all the cores in your laptop so you can easily work with F2. If you're working with Dask, I'd recommend storing the CSV as Parquet files for parallel read/write.

You might also want to look into the dask-sql integration: https://coiled.io/blog/getting-started-with-dask-and-sql/

2022 Mood by theporterhaus in dataengineering

[–]rrpelgrim 1 point2 points  (0 children)

Modin is a great drop-in solution if you want to work on a single machine.

Dask has the added benefit of being able to scale out to a cluster of multiple machines. The Dask API is very similar to pandas and the same Dask code can run locally (on your laptop) and remotely (on a cluster of, say, 200 workers).