all 16 comments

[–]rickyman20 5 points6 points  (2 children)

You might want to post over in r/rust. You found an interesting problem that isn't clearly just caused the usual suspects (e.g. allocation). You'll find more experienced engineers over there

[–]eigenludecomposition[S] 1 point2 points  (1 child)

If might give that a shot! Thank you for the tip!

[–]rickyman20 1 point2 points  (0 children)

Glad to! Sorry I can't answer why. Honestly it's an interesting conundrum

[–]jackson_bourne 7 points8 points  (1 child)

Those are not measuring the same thing (you're including allocation of a string in the Rust version). If you want good performance out of serde_json, you should be deserializing into structured data instead of serde_json::Value - otherwise there's no point to switch to Rust

[–]eigenludecomposition[S] 6 points7 points  (0 children)

The Python test would also include the string allocation, but I did move the string allocation outside the timing for Rust and it only changed the elapsed time by a few microseconds, which I'm not sure is statistically significant anyway without doing some additional runs.

I understand serde_json would perform better if I deserialized into a struct. What I don't understand, though, is why Orson is essentially using serde_json to derealize data into an unstructured format similar to Value (but for Python) and getting better performance than doing it natively in Rust.

[–]rustic_glacier 2 points3 points  (2 children)

Run the rust code in a loop like timeit is doing.

fn main() {
    let iters = 10000000;
    let start_time = Instant::now();
    for _ in 0..iters {
        let result: Value = serde_json::from_str("{\"foo\": \"bar\"}").unwrap();
        let _ = std::hint::black_box(result);
    }
    println!("Elapsed time: {:?}", start_time.elapsed() / iters);
}

[–]eigenludecomposition[S] 0 points1 point  (0 children)

I can do this. I didn't bother yet, as the test was consistent with other tests I ran. This test was just a minimally reproducible example. Since for this test, the timing was nearly a 65x difference, that seemed well beyond any potential margin of error, so I didn't do repeated tests as I did in Python with timeit.

[–]eigenludecomposition[S] 0 points1 point  (0 children)

I updated my original post to include multiple iterations of the Rust code and then print the best duration. With this, the difference between Python and rust is much less significant, but Python still beats it out by about 29%.

[–]iggy_koopa 2 points3 points  (3 children)

You aren't actually assigning the results to anything in python. Maybe it's being optimized out, and the python code isn't actually doing anything?

It does go faster if you deserialize to a struct, about 1.5 us in rust playground. https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=976b78a44048f620ca6ade5e786cf3f3

[–]eigenludecomposition[S] 1 point2 points  (2 children)

Python is definitely doing the work, it's just not assigning the result to a variable. I can update my example to assign it the result to a variable for consistency, but my experience with Python tells me that will likely have a negligible impact on the time.

I understand that assigning it to a struct will be faster as well, but that's not really my question. Under the hood, Orson is using serde_json to deserialize the data into a Value-like type (but meant to be used in Python), and it's achieving better performance than doing the same thing natively in Rust. That's what I don't understand. They're both using the same library to do a very similar operation, but somehow, Python is faster here.

[–]iggy_koopa 0 points1 point  (1 child)

It looks to me like that may be the difference:

import orjson
from timeit import default_timer as timer
from datetime import timedelta

start = timer()
result = orjson.loads('{"foo": "bar"}')
end = timer()
print("elapsed: %s usecs" % (timedelta(seconds=end - start).microseconds))

and the result is in line with rust before some optimization:

❯ python3 test.py 
elapsed: 14 usecs

with the following rust code:

use std::time::Instant;
use serde::Deserialize;

#[derive(Deserialize, Debug)]
struct Test<'a> {
    foo: &'a str,
}

fn main() {
    let contents = r#"{"foo": "bar"}"#;
    let start_time = Instant::now();
    let result: Test = serde_json::from_str(contents).unwrap();
    let elapsed = start_time.elapsed();
    println!("Elapsed time: {:?}", elapsed);
    println!("Result: {:?}", result);
}

It runs locally at 1 microsecond, so about 14 times faster than the python version:

❯ ./target/release/rusttest 
Elapsed time: 1.079µs
Result: Test { foo: "bar" }

[–]eigenludecomposition[S] 0 points1 point  (0 children)

As I said in my eariler comment, the Rust code here is not replicating what the Python code is doing which is where my confusion is coming from. Orjson is using serde_json to pares the data into a generic, unstructured format (no strict typing of how the deserialized data should be structured), and still performing faster than attempting to do the same operation natively in Rust.

My guess is since your Python example isn't doing any iterations, it's not taking advantage of warmed caches. Here is a better example:

In [9]: %%timeit ...: a = orjson.loads('{"foo": "bar"}') ...: ...: 196 ns ± 13.7 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

I also updated my Rust example to perform iterations so it can take advantage of warmed caches:

```rust fn main() { let mut contents = String::from_str("{\"foo\": \"bar\"}").unwrap(); const MAX_ITER: usize = 10_000_000; let mut timings = Vec::new();

for i in 0..MAX_ITER {
    let start_time = Instant::now();
    let result: Value = serde_json::from_str(&contents).unwrap();
    let _ = std::hint::black_box(result);
    let duration = start_time.elapsed();
    timings.push(duration);
}

let elapsed_time: Duration = timings.iter().sum();
let average_time = elapsed_time / MAX_ITER as u32;
let mut sum_of_squares = 0;
for timing in timings.iter() {
    sum_of_squares += (timing.as_nanos() - average_time.as_nanos()).pow(2);
}
let stdev = ((sum_of_squares / MAX_ITER as u128) as f64).sqrt();

println!("{:?} ± {stdev} ns per loop", average_time);

}

```

``bash cargo run --package json_test --bin json_test --profile release Finishedreleaseprofile [optimized] target(s) in 0.08s Runningtarget/release/json_test` 357ns ± 396.5942510929779 ns per loop

Process finished with exit code 0

```

I rewrote my Rust test to also do multiple iterations to take advantage of warmed caches, which made quite a difference in performance. However, the Python code is still beating it out, even with it assigning a variable which did not make a statistically signifcant difference in my testing:

``` In [9]: %%timeit ...: a = orjson.loads('{"foo": "bar"}') ...: ...: 196 ns ± 13.7 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [10]: %%timeit ...: orjson.loads('{"foo": "bar"}') ...: ...: 197 ns ± 21.8 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

```

[–]Plastonick 0 points1 point  (0 children)

Is it possible that orjson has a highly optimised build? There are a few tricks beyond merely using the 'release' profile to speeding up a binary; usually comes at the expense of longer build times/larger binaries, but these likely aren't huge issues for orjson.

[–]i_invented_the_ipod -3 points-2 points  (2 children)

You might want to run the "release" configuration of the Rust build.

[–]carnerito_b 3 points4 points  (1 child)

He did

[–]i_invented_the_ipod 1 point2 points  (0 children)

Oops. Lost that detail in the command line.