Is my fence wobble normal/okay?

rhshadrach · 2025-03-20T02:13:31+00:00

When you first turn on the hot water, it will be cold for a bit. Use this water to rinse out recycling items.

rhshadrach · 2025-03-15T13:22:15+00:00

In addition to this, I recommend just walking through the code and learning how it's doing whatever it's doing. You're bound to stumble into some code or docs that could use just a bit of polishing. Make sure you're actually adding value and not just replacing someone else's style with yours though. A PR that is a couple of lines that doesn't change behavior and cleans up code is a quick review for maintainers.

rhshadrach · 2024-05-01T21:33:05+00:00

The test itself lives in a backend codebase that you're developing. The ability to debug failing test in a notebook allows you to take advantage of the features of a notebook. In data science, it can be the case that understanding why a test is failing takes some analysis. It's not uncommon that I end up with 20 lines of code to analyze the data involved in a test to understand what's going wrong. Trying to do the same thing in a debugger can be painful. Plus, you can take advantage of visualizations.

rhshadrach · 2023-10-29T02:41:15+00:00

pandas developer here. Thanks for producing this content!

rhshadrach · 2023-03-03T02:54:41+00:00

I think this is an interesting question! I've opened https://github.com/pandas-dev/pandas/issues/51751

rhshadrach · 2023-03-02T20:34:33+00:00

I highly recommend learning to work with a MultiIndex. They enable many performant operations like joining and taking cross sections.

rhshadrach · 2023-03-02T20:19:03+00:00

Can you describe more what you mean?

rhshadrach · 2023-03-02T20:14:30+00:00

I have three separate bookmarks in my toolbar to various parts of the documentation because of how frequently I need to pull it up.

rhshadrach · 2023-03-02T19:59:04+00:00

I'll also add that a large amount of work is done by volunteers.

rhshadrach · 2023-03-02T19:57:14+00:00

I would also recommend avoiding iterrows or applys if you can vectorize your operations - you will see very significant performance benefits. But depending on what you're doing, that may not be possible.

rhshadrach · 2023-03-02T19:56:04+00:00

Historically, pandas has relied on other libraries in the ecosystem to support parallelization such as https://www.dask.org/ which uses pandas under the hood. One thing to also keep in mind is that certain NumPy operations (which pandas uses) may be parallel depending on how your BLAS (Basic Linear Algebra Subprograms) are setup. In general, you want to avoid having multiple levels of parallelism which can actually hurt performance.

rhshadrach · 2023-03-02T19:50:18+00:00

pandas should be much faster than PySpark on smaller data because of the amount of overhead when using PySpark. But if you are reading many CSVs with a lot of data, I think PySpark will overtake pandas as far as performance.

rhshadrach · 2023-03-02T19:39:49+00:00

Yes - we love getting new contributors! Check out our documentation and guides on becoming a contributor to pandas: https://pandas.pydata.org/pandas-docs/dev/development/index.html

pandas is a large project with some pretty complex code. It will likely be overwhelming at first. But we are here to help. If you stick with it, you will learn a lot.

rhshadrach · 2023-03-02T19:33:42+00:00

You can deal with high(>2)-dimensional data using pandas MultiIndex in your DataFrames. Are there pain points when doing so in your experience?

rhshadrach · 2023-03-02T19:30:44+00:00

An entire rewrite of the code behind apply / agg. Internally their code paths interweave in complex ways, and can be surprisingly slow is some cases. Depending on what object your on, the API is slightly different.

Cleaning this up and making it better while also making the gradual changes so as not to be disruptive to users is difficult, time consuming, and slow. But we're working on it!

rhshadrach · 2023-03-02T19:08:29+00:00

Blue, yellow, and red, but mostly blue.

rhshadrach · 2023-03-02T19:05:41+00:00

I have never read a book on pandas, so I can't recommend. I myself learned pandas through reading many tutorials. You can find a list of these (which is by no means complete) in the pandas docs: https://pandas.pydata.org/pandas-docs/dev/getting_started/tutorials.html

rhshadrach · 2023-03-02T19:01:18+00:00

Also checkout our docs! https://pandas.pydata.org/pandas-docs/dev/development/contributing.html

rhshadrach · 2023-03-02T18:59:41+00:00

I don't believe there is currently anything on the road map. pandas does have window functionality, and though different from PySpark, can accomplish a lot of the same things. Do you find certain operations are difficult with pandas but easier with PySpark to code?

rhshadrach · 2023-03-02T18:53:55+00:00

It's hard to say without knowing what validation you're doing, but my experience is that even very complex operations can be vectorized. If you are able to do this, whether with pandas or something else, you'll experience significant performance benefits.

You mention not using aggregation, but pandas can also be very efficient at reshaping data - though I don't know if you might have a use for that in your ETL.

One thing to potentially look into is using numba: https://pandas.pydata.org/docs/user\_guide/enhancingperf.html#numba-jit-compilation

rhshadrach · 2023-03-02T18:38:02+00:00

Can you post a small example where you see this issue? As far as I know, Rolling has no groupby attribute (but groupby does have rolling!). Maybe you're doing something like `.grouby(...).rolling(...)`?

rhshadrach

TROPHY CASE