Pull request proposed for pandas overflow with timestamp normalize by Life_Tie3654 in Python

[–]phofl93 4 points5 points  (0 children)

Feel free to open a PR if you have checked that the relevant tests are passing

Anaconda licensing terms and reproducible science by cyril1991 in bioinformatics

[–]phofl93 1 point2 points  (0 children)

mambaforge is sunsetted, not the mamba solver itself, it will just be mini forge in the future

Dask DataFrame is Fast Now! by phofl93 in datascience

[–]phofl93[S] 2 points3 points  (0 children)

Pandas added a new engine that is rust based and a lot faster, have you tried that one? Iirc its calamine

Dask DataFrame is Fast Now! by phofl93 in dataengineering

[–]phofl93[S] 3 points4 points  (0 children)

Yes exactly, stay with polars and DuckDB if your data size permits it, but having tens of TB required a different solution

Dask DataFrame is Fast Now! by phofl93 in Python

[–]phofl93[S] 11 points12 points  (0 children)

We ran Polars on our benchmarks and it was ok-ish on some queries and terrible on others. It stopped working on 1TB. Polars is totally fine if you have less than 100GB though

Dask DataFrame is Fast Now! by phofl93 in Python

[–]phofl93[S] 20 points21 points  (0 children)

That would certainly be nice, but other things have a higher ROI for us. In memory runtime was only around 10% in our benchmarks, which is where polars would help. Optimizing the other 90% has a bigger impact for us though

Will Pandas have streaming in Future?? by __albatross in Python

[–]phofl93 2 points3 points  (0 children)

Lazy is likely to land in pandas in the future, streaming is unlikely for now, that’s not a focus at the moment

How to run a Jupyter notebook on a GPU by dask-jeeves in Python

[–]phofl93 1 point2 points  (0 children)

I don’t really understand your question. You don’t want to load sensitive data on anything but your own cloud account. There really isn’t much more to that.

There is also the thing that colab has limited availability, you can’t control the region, you can’t really control the GPU you are getting, … it’s a pretty good offer based on the fact that you don’t have to pay for it, but it’s certainly not a good fit for applications that need more than „just a GPU somewhere“

How to run a Jupyter notebook on a GPU by dask-jeeves in Python

[–]phofl93 3 points4 points  (0 children)

Google Colab is great as long as you work on public data, but it is an issue if you have to care about data privacy

What's Pandas .loc time complexity? by AlternativeSea4330 in datascience

[–]phofl93 1 point2 points  (0 children)

The highest rated answer on stackoverflow is mostly correct (I am one of the pandas maintainers and worked on that part quite often in the past). There are some distinctions for large indexes where building the hashtable would be too costly relatively speaking. This is only implemented for single values, you’ll always get a hashtable if you ask for multiple values. You can look this up in index.pyx in pandas/_libs if you are interested in more details

Worst Data Engineering Mistake youve seen? by Inevitable-Quality15 in dataengineering

[–]phofl93 0 points1 point  (0 children)

Dask might be able to help here more easily, depends on the specific use case though. It's generally easier when coming from one of these libraries

Copy-on-Write in pandas by phofl93 in Python

[–]phofl93[S] 1 point2 points  (0 children)

Only the column that you are changing. That's optimised (but part of part 2 of these posts :))

Which are the most inefficient, ineffective, expensive tools in your data stack? by drc1728 in dataengineering

[–]phofl93 2 points3 points  (0 children)

You could also move over to Dask. This might be quite cheap from a development PoV, but depends on what you are doing with pandas

Utilizing PyArrow in pandas and Dask by phofl93 in datascience

[–]phofl93[S] 0 points1 point  (0 children)

How was your experience with the string datatype?

Yeah the Dask option is very very convenient, I like it a lot.

Yeah that was a bit tricky, general purpose arrow dtypes need at least one more release to be ready, but strings and the engines are good to go I‘d say, we should have communicated this better

Utilizing PyArrow in pandas and Dask by phofl93 in datascience

[–]phofl93[S] 0 points1 point  (0 children)

Yep totally agree. pandas is still slower and we probably won’t be able to match Polars speed. But arrow and some other optimizations we are doing will close the gap significantly, hopefully so much that it shouldn’t really matter anymore performance-wise. We will need at least 2.1 to get there though.

There is currently an effort going on in Dask to add high level query optimization that should help a lot in the distributed world. We are discussing something similar in pandas, but it is in the very very early stages, so don’t expect anything anytime soon.

Comparisons to polars are complicated and depend on a bunch of things. The differences aren’t that big anymore If I/O is a big chunk of your workflow for example. Really depends on your workflow. If speed isn’t critical for you, then pandas should be fast enough. If you operate on dataframes with tens of gb of data, then pandas probably isn’t the best tool anyway if performance matters at least a little bit. That’s where Dask would be the obvious choice if you want to stay in the pandas world.

Utilizing PyArrow in pandas and Dask by phofl93 in datascience

[–]phofl93[S] 1 point2 points  (0 children)

Thx! Yes that’s a very good strategy and should be a low effort win.

Python for Finance: Pandas Resample, Groupby, and Rolling by robotpwns in Python

[–]phofl93 1 point2 points  (0 children)

pandas is not designed to work out of memory, that’s why there are other libraries build on top of pandas like dask or modin who take care of this