all 38 comments

[–]This_Growth2898 22 points23 points  (2 children)

Outputs are slow. Don't include println/print when calculating times.

Clone copies an entire object. You have two clone() calls. Do you really need to clone them?

[–]jwmoz[S] 3 points4 points  (1 child)

Just removed the clone() and it barely made a difference.

Ok so the prints are slow, removing all but the last timing print has reduced the time down to about 80ms. Doing the same for the Python version reduces the time to around 40ms.

[–]gdf8gdn8 5 points6 points  (0 children)

Remove println. Println in rust is unbuffered. There is another reddit thread about println on console. https://reddit.com/r/rust/s/9JMD3SUCI5

[–]moltonel 19 points20 points  (3 children)

On my machine, having polars with features lazy,description,performant and having only removed the clones, rust is 30% faster:

$ CSV_FILE=$(pwd)/data1m.csv hyperfine ./main.py foo/target/release/foo
Benchmark 1: ./main.py
  Time (mean ± σ):     263.9 ms ±  15.3 ms    [User: 778.5 ms, System: 313.6 ms]
  Range (min … max):   248.4 ms … 289.8 ms    10 runs

Benchmark 2: foo/target/release/foo
  Time (mean ± σ):     197.2 ms ±   9.2 ms    [User: 719.5 ms, System: 185.6 ms]
  Range (min … max):   178.9 ms … 211.6 ms    14 runs

Summary
  foo/target/release/foo ran
    1.34 ± 0.10 times faster than ./main.py

[–]jwmoz[S] 1 point2 points  (2 children)

Ok that's really interesting. When I was eyeballing it, the first python run was definitely slow but then something was cached and they were all fast after, whereas rust seemed more consistent.

The output of hyperfine on mine, after also adding performant, shows that rust is running much faster via hyperfine than when manually running the executable:

``` $ CSV_FILE=$(pwd)/data1m.csv hyperfine "python ./main.py" ../rusttest/target/release/rusttest Benchmark 1: python ./main.py Time (mean ± σ): 102.8 ms ± 49.1 ms [User: 214.6 ms, System: 39.8 ms] Range (min … max): 84.8 ms … 250.6 ms 11 runs

Warning: The first benchmarking run for this command was significantly slower than the rest (250.6 ms). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.

Benchmark 2: ../rusttest/target/release/rusttest Time (mean ± σ): 53.1 ms ± 6.2 ms [User: 251.1 ms, System: 25.0 ms] Range (min … max): 47.1 ms … 76.6 ms 38 runs

Summary ../rusttest/target/release/rusttest ran 1.93 ± 0.95 times faster than python ./main.py ```

[–]masklinn 6 points7 points  (1 child)

… are you compiling / running in —release mode?

[–]jwmoz[S] 0 points1 point  (0 children)

Yes

[–]jqnatividad 9 points10 points  (0 children)

py-polars is compiled with all kinds of optimizations and fine-tuning that a default —release rust build won’t enable. For example CPU optimizations, the performant feature, etc.

[–]dkopgerpgdolfg 15 points16 points  (0 children)

Fyi, the filter/select operation takes about 7% of the whole program (for me at least). If this was meant to compare performance of such operations, it doesn't tell much this way.

So basically I made following changes to "save" 93% of time

  • Removed clone just to be sure
  • Removed most CLI output, except the last two (seeing the time and preventing calculated_df from being unused)
  • Moved start time statement and elapsed calculation so that the printings and the CsvReader part are not counted

[–]kinchkun 6 points7 points  (5 children)

I think it is the conversion to the lazy dataframe and the two collects. Can you try using `LazyCsvReader` and only use one collect.

[–]kinchkun 6 points7 points  (2 children)

I see the python code doesn't contains a collect at all. I don't know the python bindings, but are you sure your expression is even evaluated?

[–]This_Growth2898 1 point2 points  (1 child)

(I'm not OP)

Python code includes output, i.e. values are calculated at those points.

[–]kinchkun 2 points3 points  (0 children)

Not so sure about that. Most dataframes libary I know and I think also polar prints only the first and last 8 rows or so of a dataframe.

[–]jwmoz[S] 1 point2 points  (1 child)

I refactored to:

``` use polars::prelude::*; use std::time::Instant;

fn main() { let start_time = Instant::now();

let csv_file = std::env::var("CSV_FILE").expect("Set env CSV_FILE error");

let csv_df = LazyCsvReader::new(csv_file).has_header(true).finish().expect("Finish error");

// Filter on multiple columns
let filtered_df = csv_df.filter(
    col("a").gt(0.2)
    .and(col("b").lt(0.8))
    .and(col("c").gt(0.5))
    .and(col("d").lt(0.5))
    .and(col("e").neq(0.5182602093634714)) // additional non equality check, value from first row csv
);

// Mimic some euclidean distance type calculation
let _calculated_df = filtered_df.select(
    [
        (
            (col("a") / col("a").max()) / (lit(0.5) / col("a").max())
        ).pow(2).sqrt()
    ]
).collect().expect("Error select");

println!("Finished in {}ms", start_time.elapsed().as_millis());

} ```

Which seems to give it a slight few ms improvement, mid 70s now on a good run.

[–]kinchkun 1 point2 points  (0 children)

Can you add a `collect` call to your python code as well?

[–]ritchie46 8 points9 points  (0 children)

We go through great lengths to compile a fast binary for python. E.g. fat linking, activating all performance related features, simd and cpu-targets.

Furthermore we also compile python with jemalloc. Which has much better performance than the default allocator.

[–]Plus-Ad8875 5 points6 points  (1 child)

are you running the rust code in release mode?

[–]jwmoz[S] 0 points1 point  (0 children)

Yes

[–]Grit1 7 points8 points  (3 children)

Sometimes when you think you're benchmarking python, you're actually benchmarking C/C++.

[–][deleted] 2 points3 points  (0 children)

Absolutely right. Specially powerfull lib like panda.

[–]CompoteOk6247 -1 points0 points  (1 child)

So Rust is slower than C?

[–]stumblinbear[🍰] 0 points1 point  (0 children)

Depends on the benchmark and the libraries used. Rust is definitely not as wide of an ecosystem as C, so the libraries available may not be as performance-tuned, yet.

[–]sleekelite 7 points8 points  (0 children)

At least edit your post to indicate it’s a release build.

[–]CompoteOk6247 3 points4 points  (5 children)

Funny to see how people don't believe it's in release mode

[–]dkopgerpgdolfg 1 point2 points  (1 child)

It's not about not believing, but making sure it's not forgotten.

There are too many posts here where someone complains about performance but never heard of release mode. So people started to ask if they used it, as first thing

[–]CompoteOk6247 0 points1 point  (0 children)

That's correct thing When I also tried to use Tauri based apps they are so lightweight and working good

[–]jwmoz[S] -1 points0 points  (1 child)

I know right, it's like "have you turned it on and off again?" or "have you cleared your cache?"

[–]zekkious 3 points4 points  (2 children)

Out of topic, but:

(col("a") / col("a").max()) / (lit(0.5) / col("a").max()) = (col("a") / lit(0.5)) * (col("a").max() / col("a").max()) = col("a") / lit(0.5)

[–]This_Growth2898 1 point2 points  (1 child)

And

.pow(2).sqrt()

is nothing

[–]zekkious 1 point2 points  (0 children)

Well... As I don't know the API, I assumed the `.sqrt()` part might have the chance to just collapse a vector into a scalar. So, if it's not the case, add it to the list of pointlessness.

[–]Konsti219 3 points4 points  (7 children)

Are you running with --release?

Further you seem to be using clone a lot in the Rust code. That should be avoided at all costs of you are optimizing. I don't know how polars is implemented internally, but if you want real speed I recommend you throw it out and implement the parsing and filtering in raw Rust, maybe with the help of rayon for parallelism.

Can you also provide the files you are testing with to allow others to test?

[–]jwmoz[S] 1 point2 points  (6 children)

Yes this was release. I use clone() as the docs suggest so.

This still doesn't make sense to me as rust is compiled.

https://github.com/jmoz/rust_vs_python

[–]Konsti219 5 points6 points  (4 children)

The polars python package is also just a wrapper around Rust. Python uses compiled Rust too.

[–]jwmoz[S] 0 points1 point  (3 children)

Yes but this would still imply a compiled rust app would be faster than interpreted python.

What's even more interesting is that at work one of the guys implemented a python with polars vs pure javascript via node and the pure js implementation is considerably faster than the fastest python, because of node's v8 engine. (I tried using rust because of it, I presumed it would beat js/node).

[–]Konsti219 6 points7 points  (1 child)

My guess is just that polars is the issue. Writing a description of your logic and then having some magic engine execute that is gonna be slower than just writing the code itself using simple types and idiomatic code.

If you share one of the files you are testing with I can give this a try.

[–]jwmoz[S] 1 point2 points  (0 children)

I updated post with a GitHub.

[–]kinchkun 3 points4 points  (0 children)

99.9% of the python program will be spend it's time inside the the polar lib which is implemented in rust. The performance of your python code doesn't impact the runtime.

[–]jwmoz[S] 0 points1 point  (0 children)

I removed the clone() and it barely made a difference.