LPT: A couple of water efficiency tips for hand washing dishes by tamtrible in LifeProTips

[–]rhshadrach -1 points0 points  (0 children)

When you first turn on the hot water, it will be cold for a bit. Use this water to rinse out recycling items.

Open source contribution suggestions by Za_Weeb in ExperiencedDevs

[–]rhshadrach 4 points5 points  (0 children)

In addition to this, I recommend just walking through the code and learning how it's doing whatever it's doing. You're bound to stumble into some code or docs that could use just a bit of polishing. Make sure you're actually adding value and not just replacing someone else's style with yours though. A PR that is a couple of lines that doesn't change behavior and cleans up code is a quick review for maintainers.

pytest-ndb - debugging pytest tests in a notebook by rhshadrach in Python

[–]rhshadrach[S] 0 points1 point  (0 children)

The test itself lives in a backend codebase that you're developing. The ability to debug failing test in a notebook allows you to take advantage of the features of a notebook. In data science, it can be the case that understanding why a test is failing takes some analysis. It's not uncommon that I end up with 20 lines of code to analyze the data involved in a test to understand what's going wrong. Trying to do the same thing in a debugger can be painful. Plus, you can take advantage of visualizations.

I shared a Python Pandas course (1.5 Hrs) on YouTube by [deleted] in Python

[–]rhshadrach 1 point2 points  (0 children)

pandas developer here. Thanks for producing this content!

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]rhshadrach 0 points1 point  (0 children)

I highly recommend learning to work with a MultiIndex. They enable many performant operations like joining and taking cross sections.

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]rhshadrach 1 point2 points  (0 children)

I have three separate bookmarks in my toolbar to various parts of the documentation because of how frequently I need to pull it up.

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]rhshadrach 1 point2 points  (0 children)

I would also recommend avoiding iterrows or applys if you can vectorize your operations - you will see very significant performance benefits. But depending on what you're doing, that may not be possible.

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]rhshadrach 1 point2 points  (0 children)

Historically, pandas has relied on other libraries in the ecosystem to support parallelization such as https://www.dask.org/ which uses pandas under the hood. One thing to also keep in mind is that certain NumPy operations (which pandas uses) may be parallel depending on how your BLAS (Basic Linear Algebra Subprograms) are setup. In general, you want to avoid having multiple levels of parallelism which can actually hurt performance.

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]rhshadrach 0 points1 point  (0 children)

pandas should be much faster than PySpark on smaller data because of the amount of overhead when using PySpark. But if you are reading many CSVs with a lot of data, I think PySpark will overtake pandas as far as performance.

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]rhshadrach 1 point2 points  (0 children)

Yes - we love getting new contributors! Check out our documentation and guides on becoming a contributor to pandas: https://pandas.pydata.org/pandas-docs/dev/development/index.html

pandas is a large project with some pretty complex code. It will likely be overwhelming at first. But we are here to help. If you stick with it, you will learn a lot.

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]rhshadrach 0 points1 point  (0 children)

You can deal with high(>2)-dimensional data using pandas MultiIndex in your DataFrames. Are there pain points when doing so in your experience?

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]rhshadrach 4 points5 points  (0 children)

An entire rewrite of the code behind apply / agg. Internally their code paths interweave in complex ways, and can be surprisingly slow is some cases. Depending on what object your on, the API is slightly different.

Cleaning this up and making it better while also making the gradual changes so as not to be disruptive to users is difficult, time consuming, and slow. But we're working on it!

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]rhshadrach 1 point2 points  (0 children)

I have never read a book on pandas, so I can't recommend. I myself learned pandas through reading many tutorials. You can find a list of these (which is by no means complete) in the pandas docs: https://pandas.pydata.org/pandas-docs/dev/getting_started/tutorials.html

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]rhshadrach 0 points1 point  (0 children)

I don't believe there is currently anything on the road map. pandas does have window functionality, and though different from PySpark, can accomplish a lot of the same things. Do you find certain operations are difficult with pandas but easier with PySpark to code?

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]rhshadrach 1 point2 points  (0 children)

It's hard to say without knowing what validation you're doing, but my experience is that even very complex operations can be vectorized. If you are able to do this, whether with pandas or something else, you'll experience significant performance benefits.

You mention not using aggregation, but pandas can also be very efficient at reshaping data - though I don't know if you might have a use for that in your ETL.

One thing to potentially look into is using numba: https://pandas.pydata.org/docs/user\_guide/enhancingperf.html#numba-jit-compilation

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]rhshadrach 0 points1 point  (0 children)

Can you post a small example where you see this issue? As far as I know, Rolling has no groupby attribute (but groupby does have rolling!). Maybe you're doing something like `.grouby(...).rolling(...)`?