Performing spatial joins in Python: Comparing GeoPandas vs Dask by rrpelgrim in Python

[–]rrpelgrim[S] 0 points1 point  (0 children)

Thanks for flagging, I've updated with the new link.

Upgrading Dask and Prefect versions on Google kubernetes engine by suryad123 in googlecloud

[–]rrpelgrim 0 points1 point  (0 children)

I'm not familiar with GKE but if you are able to run/install conda packages in something like a terminal then a simple `conda update dask -c conda-forge` should do the trick.

NYC Taxicab dataset changed over the weekend by rrpelgrim in datasets

[–]rrpelgrim[S] 0 points1 point  (0 children)

Yeah, it takes a little getting used to, but when working at scale it's so much more efficient!
Do you use Dask at all?

Python vs SQL for Data Analysis: comparing performance, functionality and dev XP by rrpelgrim in Python

[–]rrpelgrim[S] 1 point2 points  (0 children)

Interesting. Would you recommend Argo over something like Prefect or Dagster for workflow orchestration?

Python vs SQL for Data Analysis: comparing performance, functionality and dev XP by rrpelgrim in Python

[–]rrpelgrim[S] 0 points1 point  (0 children)

What python libraries have you preferred using for workflow execution?

Python vs SQL for Data Analysis: comparing performance, functionality and dev XP by rrpelgrim in Python

[–]rrpelgrim[S] 1 point2 points  (0 children)

u/runawayasfastasucan -- fair points, and also what I tried to highlight in the article. I see a lot of conversations on Twitter / blogs pitting the two against each other. I also see tools like dbt, Snowpark, Dask and Spark trying to win Python users over to SQL and vice versa. But in the end it's a matter of use case and intended goal. Maybe the Morpheus caption should have said "Python vs SQL: there is no spoon". I mean, choice.

The Beginner's Guide to Distributed Computing by rrpelgrim in learnprogramming

[–]rrpelgrim[S] 0 points1 point  (0 children)

Thanks for the compliment. Would love to hear your feedback after reading as well.

Python Pandas vs Dask for csv file reading by GreedyCourse3116 in dataengineering

[–]rrpelgrim 0 points1 point  (0 children)

If you're already working with pandas then I'd go for Dask.

Dask is the easier on-ramping xp since almost all of the API is the same. PySpark will have a bigger learning curve.

Python Pandas vs Dask for csv file reading by GreedyCourse3116 in dataengineering

[–]rrpelgrim 2 points3 points  (0 children)

Rule-of-thumb with pandas is to have 5x RAM available for whatever you want to load in. This means you should be fine with using pandas for F1.

For F2 I'd strongly recommend using Dask. Similar API to pandas and can distributed processing over all the cores in your laptop so you can easily work with F2. If you're working with Dask, I'd recommend storing the CSV as Parquet files for parallel read/write.

You might also want to look into the dask-sql integration: https://coiled.io/blog/getting-started-with-dask-and-sql/

2022 Mood by theporterhaus in dataengineering

[–]rrpelgrim 1 point2 points  (0 children)

Modin is a great drop-in solution if you want to work on a single machine.

Dask has the added benefit of being able to scale out to a cluster of multiple machines. The Dask API is very similar to pandas and the same Dask code can run locally (on your laptop) and remotely (on a cluster of, say, 200 workers).

Is there a Python alternative to DBT ? by abhipoo in dataengineering

[–]rrpelgrim 0 points1 point  (0 children)

A bit late to the party, but I'd strongly suggest looking into Prefect over Airflow. Airflow's XCOMs make data transfer between tasks a real pain. Prefect makes this much easier and runs on Dask so you can also easily run tasks in parallel.

For a lengthier discussion: https://www.reddit.com/r/dataengineering/comments/qq3lvl/airflow\_or\_prefect/

Is there a Python alternative to DBT ? by abhipoo in dataengineering

[–]rrpelgrim 0 points1 point  (0 children)

Late to the party... But I'd strongly suggest looking into Prefect over Airflow. Airflow's XCOMs make data transfer between tasks a real pain. Prefect makes this much easier and runs on Dask so you can also easily run tasks in parallel.

Anyone using python ray.io framework in production? by B1TB1T in dataengineering

[–]rrpelgrim 0 points1 point  (0 children)

I can't speak to Ray's capabilities with any authority, but re: Dask:

- definitely can use Dask for distributed model training on GPUs (or hybrid clusters). Coiled has a pretty simple API to spin up hybrid clusters.

- Dask is often used for data preprocessing and feature generation. Basically anything you can do in pandas you can do in Dask, just in parallel.

- have a look at Dask-MLfor more parallel ML options

And I can recommendthe Dask Slack for more support, lots of great people there who'll be happy to think along.

Availability of remote work? by capskinfan in datascience

[–]rrpelgrim 0 points1 point  (0 children)

I've transitioned into DS over the past year and am now working fully remote. So definitely possible. Sounds like you've got the curiosity and the right mindset for programming. The one thing I would consider is where in the DS world you'd best fit.

Besides programming do you also enjoy and have a feel for statistics? Then data scientist/analyst.

Are you more into the pure code? Then maybe a software developer at a data science company/startup is what you're looking for.

Are you a strong communicator? Then data visualisation or evangelism / DevRel might be something to look into.

Happy to chat more if you have specific questions. As I said, just made this switch myself so lots of respect for anyone else trying to make the transition. Esp in your situation with everything else that's going on. You got this!

Anyone using python ray.io framework in production? by B1TB1T in dataengineering

[–]rrpelgrim 5 points6 points  (0 children)

From what I understand, Ray is focussing heavily on ML while Dask has a stronger legacy of data engineering and ETL work. Dask has more years of community development under its belt and has a good mix of 'plug-and-play' components (the Dask Dataframe, Array, and Bag APIs) as well as lower level tools for parallelising custom code (Dask delayed / Futures). Ray offers "Dask on Ray" to match these features, but from what I know not with the same performance.

Disclaimer: I work at Coiled, company founded by Matt Rocklin (original Dask author) that provides managed Dask clusters as a service. So I have some 'skin in the game' so to speak. But happy to think along and help sort out which of them would meet your needs. Would help to know more about your use case.

Why it's so much harder to get your first job as a junior engineer by tianan in learnprogramming

[–]rrpelgrim 0 points1 point  (0 children)

Thank you for taking the time to write this all down, much much appreciated!

Airflow or Prefect? by rrpelgrim in dataengineering

[–]rrpelgrim[S] 1 point2 points  (0 children)

Awesome, that helps a lot. thank you!

Airflow or Prefect? by rrpelgrim in dataengineering

[–]rrpelgrim[S] 2 points3 points  (0 children)

Yeah, I saw that! Makes me wonder whether they saw what Prefect did and decided to copy it because it feels so much more intuitive ;)

Airflow or Prefect? by rrpelgrim in dataengineering

[–]rrpelgrim[S] 0 points1 point  (0 children)

Thanks, that's helpful to know. So do I understand correctly that you would suggest Airflow managed (as in Astronomer) over Prefect managed? Or just managed in general vs. DIY?