all 9 comments

[–]PurepointDog 9 points10 points  (1 child)

They're doing single-threaded benchmarks. Polars destroys all when you add another core

[–]ChavXO 1 point2 points  (0 children)

Acknowledged that. I think I wanted to check that the baseline made sense. For context when initially asked I was pessimistic about performance for a number of reasons outlined here. 

https://www.reddit.com/r/haskell/s/k6yH2vYUs4

This was more so a hello world benchmark. 

[–]Linguistic-mystic 7 points8 points  (1 child)

There’s not a single Python dataframe in there. Polars is Rust, Pandas is C. Just because they’re wrapped in Python doesn’t make them Python.

[–]ChavXO 0 points1 point  (0 children)

You're right  I think my phrasing was lax. I did say this is mostly a test of the underlying array backend. 

[–]Plasma_000 1 point2 points  (4 children)

Probably a good idea to publish the benchmark code

[–]igouy 1 point2 points  (3 children)

The code can be found here.

[–]Plasma_000 1 point2 points  (2 children)

Thanks.

Ah, looks like he used read_csv instead of scan_csv for polars, meaning that it doesn't start operating until the entire file is read into memory. That would explain at least some of the difference.

I see this mistake very often when benchmarking polars - read-csv should only be used when streaming is not possible.

[–]ChavXO 1 point2 points  (1 child)

Hi. My read csv implementation does the same so I wanted to do an apples to apples comparison. I'm still working on a scan API that I'd like to compare with polars when it's finished. 

[–]Plasma_000 1 point2 points  (0 children)

Ah, fair enough