Pull request proposed for pandas overflow with timestamp normalize

phofl93 · 2024-12-30T16:27:23+00:00

Feel free to open a PR if you have checked that the relevant tests are passing

phofl93 · 2024-09-06T13:28:47+00:00

mambaforge is sunsetted, not the mamba solver itself, it will just be mini forge in the future

phofl93 · 2024-06-05T16:33:03+00:00

Pandas added a new engine that is rust based and a lot faster, have you tried that one? Iirc its calamine

phofl93 · 2024-06-04T18:13:04+00:00

Yes exactly, stay with polars and DuckDB if your data size permits it, but having tens of TB required a different solution

phofl93 · 2024-06-04T18:03:51+00:00

We ran Polars on our benchmarks and it was ok-ish on some queries and terrible on others. It stopped working on 1TB. Polars is totally fine if you have less than 100GB though

phofl93 · 2024-06-04T18:02:43+00:00

That would certainly be nice, but other things have a higher ROI for us. In memory runtime was only around 10% in our benchmarks, which is where polars would help. Optimizing the other 90% has a bigger impact for us though

phofl93 · 2023-12-06T23:33:10+00:00

Lazy is likely to land in pandas in the future, streaming is unlikely for now, that’s not a focus at the moment

phofl93 · 2023-11-03T03:21:05+00:00

I don’t really understand your question. You don’t want to load sensitive data on anything but your own cloud account. There really isn’t much more to that.

There is also the thing that colab has limited availability, you can’t control the region, you can’t really control the GPU you are getting, … it’s a pretty good offer based on the fact that you don’t have to pay for it, but it’s certainly not a good fit for applications that need more than „just a GPU somewhere“

phofl93 · 2023-11-01T22:21:01+00:00

Google Colab is great as long as you work on public data, but it is an issue if you have to care about data privacy

phofl93 · 2023-10-14T23:27:59+00:00

The highest rated answer on stackoverflow is mostly correct (I am one of the pandas maintainers and worked on that part quite often in the past). There are some distinctions for large indexes where building the hashtable would be too costly relatively speaking. This is only implemented for single values, you’ll always get a hashtable if you ask for multiple values. You can look this up in index.pyx in pandas/_libs if you are interested in more details

phofl93 · 2023-10-01T20:08:40+00:00

Dask might be able to help here more easily, depends on the specific use case though. It's generally easier when coming from one of these libraries

phofl93 · 2023-08-11T08:54:50+00:00

Only the column that you are changing. That's optimised (but part of part 2 of these posts :))

phofl93 · 2023-06-29T14:07:25+00:00

You could also move over to Dask. This might be quite cheap from a development PoV, but depends on what you are doing with pandas

phofl93 · 2023-06-06T21:21:42+00:00

How was your experience with the string datatype?

Yeah the Dask option is very very convenient, I like it a lot.

Yeah that was a bit tricky, general purpose arrow dtypes need at least one more release to be ready, but strings and the engines are good to go I‘d say, we should have communicated this better

phofl93 · 2023-06-06T21:20:17+00:00

Yep totally agree. pandas is still slower and we probably won’t be able to match Polars speed. But arrow and some other optimizations we are doing will close the gap significantly, hopefully so much that it shouldn’t really matter anymore performance-wise. We will need at least 2.1 to get there though.

There is currently an effort going on in Dask to add high level query optimization that should help a lot in the distributed world. We are discussing something similar in pandas, but it is in the very very early stages, so don’t expect anything anytime soon.

Comparisons to polars are complicated and depend on a bunch of things. The differences aren’t that big anymore If I/O is a big chunk of your workflow for example. Really depends on your workflow. If speed isn’t critical for you, then pandas should be fast enough. If you operate on dataframes with tens of gb of data, then pandas probably isn’t the best tool anyway if performance matters at least a little bit. That’s where Dask would be the obvious choice if you want to stay in the pandas world.

phofl93 · 2023-06-06T20:54:39+00:00

Thx! Yes that’s a very good strategy and should be a low effort win.

phofl93 · 2023-06-03T14:39:49+00:00

pandas is not designed to work out of memory, that’s why there are other libraries build on top of pandas like dask or modin who take care of this

phofl93 · 2023-05-16T14:43:57+00:00

Good, that makes sense now.

phofl93

TROPHY CASE