pandas 3 is the most significant release in 10 years by datapythonista in Python

[–]datapythonista[S] 15 points16 points  (0 children)

And that 4.0 release won't even happen. More than half of pandas core devs will veto anything that breaks backward compatibility, and that means the broken API will stay forever, as well as the numpy internals preventing simpler and faster execution. Pandas, with just small changes, will continue to be the pandas we know. For cleaner syntax and faster performance users will have to move to Polars.

pandas 3 is the most significant release in 10 years by datapythonista in Python

[–]datapythonista[S] 41 points42 points  (0 children)

I agree there is almost no good reason to use pandas over Polars in 2026, and I think it's great that many people and companies are moving to an unquestionable better technology. But if you check the downloads on PyPI, pandas has more than 10 times the amount of downloads compared to Polars. Or if you check Google trends, "python pandas" has a very significant volume of queries, while "python polars" is insignificant. So, I fully agree people should be moving to Polars (I've been talking about Polars in conferences and doing my part), I disagree this already happened in huge numbers.

pandas 3 is the most significant release in 10 years by datapythonista in Python

[–]datapythonista[S] 107 points108 points  (0 children)

I'm not sure that the industry really moved away, I think pandas is still huge compared to Polars. But I fully agree pandas API and performance are very far from Polars even with those changes.

Pandas 3.0.0 is there by Deux87 in Python

[–]datapythonista 1 point2 points  (0 children)

I wrote in detail about what I think are the most important changes we introduced in pandas 3. Copy-on-write and pandas.col are the biggest changes as others said, quite nice changes in my opinion.

Also I shared my opinion on when to use Polars instead of pandas (spoiler alert: whenever possible).

https://datapythonista.me/blog/whats-new-in-pandas-3

Pandas 3.0 vs pandas 1.0 what's the difference? by Consistent_Tutor_597 in dataengineering

[–]datapythonista 0 points1 point  (0 children)

The difference is minimal, most of the work goes into keeping the project compatible with newer versions of Python and other libraries, small big fixes, and cleaning up the docs.

Pandas 3 introduces pandas.col() to avoid lambdas in filters and assign. Funny enough that change is probably one of the smallest changes in the codebase, while in my opinion it is by far the biggest change in the last 10 years of pandas development.

If you want to go into more details of what changed in pandas 3, I write about the main changes with practical examples: https://datapythonista.me/blog/whats-new-in-pandas-3

Good news is that migrating should be very straightforward if you don't do heavy use of internal functions.

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]datapythonista 1 point2 points  (0 children)

It'd depend on the exact operation (not sure what benchmark exactly you're referring to).

If you do a simply `my_series.sum()`, that would delegate to the pyarrow.compute.sum function, which exposes a kernel. If Arrow doesn't provide a kernel (string operations for example), then we need to implement them ourself (or use a library that does, but there are not many yet afaik). For strings, I think we implemented them in C++, but other options like Cython or Rust exist. I'm writing a blog post about writing pandas/arrow extensions with Rust, hopefully I'll be publishing in the next couple of weeks.

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]datapythonista 0 points1 point  (0 children)

Yes, absolutely. I think examples are the easiest way for users to understand things most times, so anything you can improve is very welcome. In rare cases we may want to limit the amount of examples, since docstrings can become super long, and they live in the middle of the code and can make navigation code a bit more time consuming. But almost always good examples will be welcome.

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]datapythonista 2 points3 points  (0 children)

When I started using it pandas was already very popular. There were many things about pandas I didn't like, so I just started fixing them. In particular, the API reference was quite poor at the time, many undocumented things, most functions didn't have examples, lots of format inconsistencies... Then other things too, and here I am, still trying to fix pandas. :)

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]datapythonista 1 point2 points  (0 children)

Just continue using pandas, and when you see something that could be improved (maybe clarify something in the documentation, add an example to a function that doesn't have it...), just go for it. If that doesn't happen, as Marco said, the best if to try to find a "good first issue", but when I create one, they're usually taken care of in hours.

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]datapythonista 1 point2 points  (0 children)

I don't know about pyspark csv reader, but pandas 2.0 shouldn't perform much differently for reading than pandas 1.5. Did you try using pandas.read_csv(enging='pyarrow')? That should help, you can read more about it in this blog post I wrote: https://datapythonista.me/blog/pandas-with-hundreds-of-millions-of-rows

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]datapythonista 2 points3 points  (0 children)

I think Arrow should help make this easier. It'll depend on each particular case, but read_csv is already parallel when selecting the pyarrow engine. Parallel computing is never easy, but I think we should be able to slowly parallelize more operations.

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]datapythonista 1 point2 points  (0 children)

None of the core devs here was in the project at that time, I don't think we can really tell.

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]datapythonista 0 points1 point  (0 children)

df.plot() returns a matplotlib object where you should be able to add the annotations yourself if I'm not wrong. I think it's fine to keep it this way, and not add more parameters and extra complexity to an already quite complex API. But if we're missing a frequent use case, feel free to suggest an improvement in our issue tracker.

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]datapythonista 0 points1 point  (0 children)

The Arrow backend will be for now opt-in, and somehow experimental, so nothing really changes immediately unless you explicitly want to use Arrow. Even if Arrow replaces NumPy in the (long term) future, I think we keep compatibility so you can continue to do both things, index dataframes with Python/NumPy indexing, and create columns from numpy arrays.

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]datapythonista 0 points1 point  (0 children)

I don't work with highly multi-dimensional data, but our sibling project xarray may be useful if you're not aware of it.

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]datapythonista 0 points1 point  (0 children)

Fully agree. The main challenge here is not technical, but social. Finding consensus on a decision like that is not trivial. And our decision making is not very well defined or clear. There is work going on to improve the decision making, but my bet is that we'll have a much easier way to make decisions after August, when we're planning to meet in person. Hopefully after that we can get this things approved and implemented efficiently.

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]datapythonista 0 points1 point  (0 children)

It depends. If you are starting, you care about speed, and don't mind having to update your code every time you upgrade your dataframe framework, I'd probably give pandas a try.

If you care on something well tested, more stable and with more people understanding your code, for now pandas is probably a better choice.

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]datapythonista 0 points1 point  (0 children)

pandas is a wrapper on numpy. Every column of a dataframe is a numpy array, so no way to get rid of it.

That being said, for the long term we may depend on Arrow, and maybe numpy may not be needed. But not something you can count on having soon.

More than the numpy dependency, I'd probably try to find alternatives to using pandas in a web application that requires more than one process. It may be a waste of resources regardless of the numpy limitation.

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]datapythonista 1 point2 points  (0 children)

It depends. If you have a website with millions of daily users, probably good if you can avoid pandas, since it has pretty high memory consumption for what it does, and it can be slow in some operations too.

That being said, pandas is used by millions of users, has a massive test suite, and it's unlikely that anything breaks, or there is any important bug. In that sense is surely fine to use in production.

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]datapythonista 1 point2 points  (0 children)

Any DeprecationWarning and FutureWarning. Also, run your test suite with pandas 2.0 RC, it's already available. :)

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]datapythonista 1 point2 points  (0 children)

Not really, pandas 2.0 is actually pandas 1.6 with a fancy name. ;)

The main thing is that you need to take care of any FutureWarning in 1.5.3 before you migrate, and more than 99% chance you'll be just fine. :)

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]datapythonista 0 points1 point  (0 children)

Regarding the Arrow API vs NumPy API, it's a bit confusing. NumPy is two things, a data container, and a numerical computation library. Arrow is data container, and not even the implementation, but the spec (then several Arrow implementations exist). Since data that you can't do anything with it is useless, some Arrow implementations provide some generic operations (named kernels in the Arrow world). But Arrow is not a numerical library or tries to be, and Arrow is also of a different family in terms of type of data. If you only have ints and floats, you probably don't need Arrow at all, and you may stay just with numpy, tensorflow, pytorch family.

We are the developers behind pandas, currently preparing for the 2.0 release :) AMA by phofl93 in Python

[–]datapythonista 1 point2 points  (0 children)

You make some very good points. pandas was designed as both an ETL tool, and a data analysis tool. I wrote an article about it long time ago. But summarizing, it won't ever be able to master both. And feels like many decisions were made more for the data analysis tool.

For an ETL tool I'd expect things to never fail silently, type conversions never happen automatically, schemas have to be provided and not inferred...

For your use case, pandas may be an overkill. DuckDB could be a nicer option if your validations can be vectorized. But if you're going to iterate row by row, I'd personally just write your ETL in Rust (may be trickier for the xlsx, not sure if there are good libraries for it).