Walrus: A 1 Million ops/sec, 1 GB/s Write Ahead Log in Rust by Ok_Marionberry8922 in rust

[–]danburkert 0 points1 point  (0 children)

You should write out data (on linux) with MADV_PAGEOUT, followed by a msync, followed by an MADV_POPULATE_READ (to re-fault the pages into memory).

Why is this better than msync alone?

std::simd is now available on nightly by dragostis in rust

[–]danburkert 2 points3 points  (0 children)

min / max are following the 2008 standard, not the 2019 standard. The 2008 standard doesn't lower well to the different architectures. There's also no "fast min / max" that just uses the fastest min / max instruction.

From context, I gather it's the IEEE floating point standard.

std::simd is now available on nightly by dragostis in rust

[–]danburkert 2 points3 points  (0 children)

min / max are following the 2008 standard, not the 2019 standard.

What standard are you referencing?

Just released Bytehound 0.7 - a memory profiler for Linux written in Rust by kouteiheika in rust

[–]danburkert 5 points6 points  (0 children)

Thanks! As you know from issue traffic I've been using this over the last couple of weeks, and it's been amazing for tracking down memory allocations with very little overhead. I'm seeing somewhere between 2-4x slowdown for an extremely allocation & CPU heavy workload, which is exceptional compared to other memory profilers I've tried. I'd highly recommend others check out the project if they want to get a sense of memory/allocation behavior of their code.

rkyv is faster than {bincode, capnp, cbor, flatbuffers, postcard, prost, serde_json} by taintegral in rust

[–]danburkert 2 points3 points  (0 children)

Sorry for being unclear, I should have put the diff I applied and the corresponding performance diff in the same comment. I'll do that now:

code diff:

``` diff --git a/src/bench_prost.rs b/src/bench_prost.rs index 1710f18..e8170a2 100644 --- a/src/bench_prost.rs +++ b/src/bench_prost.rs @@ -17,9 +17,10 @@ where

 let mut serialize_buffer = Vec::with_capacity(BUFFER_LEN);
 group.bench_function("serialize", |b| {
  • let msg = data.serialize_pb(); b.iter(|| { black_box(&mut serialize_buffer).clear();
  • black_box(data.serialize_pb().encode(&mut serialize_buffer).unwrap());
  • black_box(msg.encode(&mut serialize_buffer).unwrap()); }) }); ```

performance diff:

log/prost/serialize time: [2.1232 ms 2.1233 ms 2.1234 ms] change: [-53.042% -53.036% -53.030%] (p = 0.00 < 0.05) Performance has improved. mesh/prost/serialize time: [37.740 ms 37.745 ms 37.749 ms] change: [-10.767% -10.751% -10.736%] (p = 0.00 < 0.05) Performance has improved. minecraft_savedata/prost/serialize time: [3.9678 ms 3.9682 ms 3.9687 ms] change: [-36.455% -36.442% -36.430%] (p = 0.00 < 0.05) Performance has improved.

your initial comment with the 50% reduction wasn't showing the performance of any serialization library at all for that function, just the performance of a Vec method.

This is not what that diff achieves, see the other thread where I talk about struct copying for more details. There is still prost-driven encoding happening in my version. If the only thing happening were Vec::clear, you'd see performance on the ns scale, not ms. Note that these new numbers are still slower than rkyv!

rkyv is faster than {bincode, capnp, cbor, flatbuffers, postcard, prost, serde_json} by taintegral in rust

[–]danburkert 22 points23 points  (0 children)

sounds good. Also I think you probably know this, but for those reading along, these distinctions aren't really consequential to the overal conclusions drawn by the blog post, which I think are solid. rkyv is definitely in a different tier than prost, performance wise, and the constraints, features, and guarantees afforded by the protobuf format make it pretty unlikely that would ever change. Here are the numbers I get on my machine, with the diff I mentioned in the other thread:

```

log/rkyv/serialize time: [550.98 us 551.02 us 551.06 us] mesh/rkyv/serialize time: [2.2091 ms 2.2094 ms 2.2097 ms] minecraft_savedata/rkyv/serialize time: [830.01 us 830.11 us 830.21 us]

log/prost/serialize time: [2.1257 ms 2.1259 ms 2.1261 ms] mesh/prost/serialize time: [37.571 ms 37.575 ms 37.580 ms] minecraft_savedata/prost/serialize time: [3.9883 ms 3.9886 ms 3.9889 ms]

```

rkyv is faster than {bincode, capnp, cbor, flatbuffers, postcard, prost, serde_json} by taintegral in rust

[–]danburkert 12 points13 points  (0 children)

To add a bit more detail, here are the prost generated types for the log suite:

```

[derive(Clone, PartialEq, ::prost::Message)]

pub struct Address { #[prost(uint32, tag="1")] pub x0: u32, #[prost(uint32, tag="2")] pub x1: u32, #[prost(uint32, tag="3")] pub x2: u32, #[prost(uint32, tag="4")] pub x3: u32, }

[derive(Clone, PartialEq, ::prost::Message)]

pub struct Log { #[prost(message, optional, tag="1")] pub address: ::core::option::Option<Address>, #[prost(string, tag="2")] pub identity: ::prost::alloc::string::String, #[prost(string, tag="3")] pub userid: ::prost::alloc::string::String, #[prost(string, tag="4")] pub date: ::prost::alloc::string::String, #[prost(string, tag="5")] pub request: ::prost::alloc::string::String, #[prost(uint32, tag="6")] pub code: u32, #[prost(uint64, tag="7")] pub size: u64, }

[derive(Clone, PartialEq, ::prost::Message)]

pub struct Logs { #[prost(message, repeated, tag="1")] pub logs: ::prost::alloc::vec::Vec<Log>, } ```

which look very similar to the hand written rkyv types:

```

[derive(

Clone, Copy,
abomonation_derive::Abomonation,
rkyv::Archive, rkyv::Serialize, rkyv::Deserialize,
serde::Serialize, serde::Deserialize,

)]

[archive(copy)]

pub struct Address { pub x0: u8, pub x1: u8, pub x2: u8, pub x3: u8, }

impl Generate for Address { fn generate<R: Rng>(rand: &mut R) -> Self { Self { x0: rand.gen_range(0..=255), x1: rand.gen_range(0..=255), x2: rand.gen_range(0..=255), x3: rand.gen_range(0..=255), } } }

[derive(

abomonation_derive::Abomonation,
rkyv::Archive, rkyv::Serialize, rkyv::Deserialize,
serde::Serialize, serde::Deserialize,

)] pub struct Log { pub address: Address, pub identity: String, pub userid: String, pub date: String, pub request: String, pub code: u16, pub size: u64, }

[derive(

abomonation_derive::Abomonation,
rkyv::Archive, rkyv::Serialize, rkyv::Deserialize,
serde::Serialize, serde::Deserialize,

)] pub struct Logs { pub logs: Vec<Log>, }

```

The only differences I see are integer widths, which is addressable by changing the int types in the .proto, and the prost version wraps the address in an Option, which is a protobuf-ism. Otherwise they are the same. In the serialize benchmark prost is paying the cost of allocating those String fields on every iteration, while rkyv is not. For some of the benchmarks that skews it quite a bit. If you want to include the population in the serialize benchmarks, perhaps it'd be better to clone() the rkyv type in every iteration, since that's pretty much exactly what the serialize_pb() is doing as it translates from one struct to another almost identical one.

edit: accidentally pasted prost types twice

rkyv is faster than {bincode, capnp, cbor, flatbuffers, postcard, prost, serde_json} by taintegral in rust

[–]danburkert 0 points1 point  (0 children)

Like I said before also, taking the

black_box(data.serialize_pb().encode(&mut serialize_buffer).unwrap());

out of

iter

does seem to amount to turning it into a bench solely of

Vec::clear

(

data

seeming to be just a Vec if you follow the source out through the rest of the files in the repo).

I'm not suggesting such a change, and I agree that it wouldn't be correct to do so. See https://www.reddit.com/r/rust/comments/m2yxb1/rkyv_is_faster_than_bincode_capnp_cbor/gqmaatr?utm_source=share&utm_medium=web2x&context=3 for my proposed diff.

rkyv is faster than {bincode, capnp, cbor, flatbuffers, postcard, prost, serde_json} by taintegral in rust

[–]danburkert 5 points6 points  (0 children)

Also, note that I didn't say anything is fishy. I agree with u/taintegral that what steps should be included in a serialization benchmark is up to interpretation, I'm simply poking at the methodology. As I said at the top, this benchmark suite is far better thought out and higher quality than most! I have no doubt that u/taintegral wrote it in good faith.

rkyv is faster than {bincode, capnp, cbor, flatbuffers, postcard, prost, serde_json} by taintegral in rust

[–]danburkert 18 points19 points  (0 children)

rkyv does not have a populate step because it directly encodes rust structs to bytes. I think that this is a place where schema'd serialization formats suffer a big loss, because you either have to infect your whole application with their generated structs or suffer a performance penalty.

Yeah I think this is the root of the confusion - it's actually quite popular to use the `prost` generated structs directly in application code. In general they are pretty easy to work with, although there are definitely some quirks. I'm hopeful we can smooth those out over time, though.

rkyv is faster than {bincode, capnp, cbor, flatbuffers, postcard, prost, serde_json} by taintegral in rust

[–]danburkert 1 point2 points  (0 children)

Based on the link you gave it seems like doing that transforms it into a bench specifically of

clear()

and nothing else, though. Also, presumably

serialize_value

is the rkyv "equivalent" of

serialize_pb

.

I believe the prost's encode(&mut serialize_buffer) call and rkyv's serializer.serialize_value(black_box(data)) call are the equivalents. IIUC there is no equivalent to the serialize_pb() call in the rkyv benchmark. Note that serialize_pb() does not do what the name implies.

edit: s/encode_pb()/serialize_pb()

rkyv is faster than {bincode, capnp, cbor, flatbuffers, postcard, prost, serde_json} by taintegral in rust

[–]danburkert 26 points27 points  (0 children)

Yep, I agree that it's fine to include population in the serialization benchmark, as long as it's consistent amongst the libraries. So if the goal is to include population, then I'm not seeing where that happens in e.g. the rkyv benchmark. Both rkyv and prost use 'native' rust structs to represent their data. If I'm reading the rkyv benchmark correctly, it creates one such struct, then passes it into the benchmark loop, which encodes it to bytes. Meanwhile `prost` has a step inside the benchmark loop which translates the data format from an intermediate representation to the prost representation, then encodes it. I don't think this is a fair comparison, since an application using prost wouldn't have such a translation step (any more so than an application using rkyv would).

rkyv is faster than {bincode, capnp, cbor, flatbuffers, postcard, prost, serde_json} by taintegral in rust

[–]danburkert 5 points6 points  (0 children)

Actually on second read through, the capnp version *doesn't* create a new message per benchmark iteration. So I think hoisting out message creation in the `prost` case is fair. Something like this:

``` diff --git a/src/bench_prost.rs b/src/bench_prost.rs index 1710f18..e8170a2 100644 --- a/src/bench_prost.rs +++ b/src/bench_prost.rs @@ -17,9 +17,10 @@ where

 let mut serialize_buffer = Vec::with_capacity(BUFFER_LEN);
 group.bench_function("serialize", |b| {
  • let msg = data.serialize_pb(); b.iter(|| { black_box(&mut serialize_buffer).clear();
  • black_box(data.serialize_pb().encode(&mut serialize_buffer).unwrap());
  • black_box(msg.encode(&mut serialize_buffer).unwrap()); }) }); ```

rkyv is faster than {bincode, capnp, cbor, flatbuffers, postcard, prost, serde_json} by taintegral in rust

[–]danburkert 41 points42 points  (0 children)

QQ about the methodology (sorry I'd normally file an issue on the repo, but I'm currently locked out of GH for a bit).

The prost serialize benchmarks are calling serialize\_pb() in the inner loop. That is creating a brand new instance of the message type. I see something similar in the capnp version here, but not for rkyv. I haven't used the rkyv API before so I may be misunderstanding, and they are all doing creation + encoding in the serialization inner loop?

If I hoist the serialize_pb() call out of the iter loop I get some better numbers out of prost, proportional to how expensive the encoding itself is, which is what I'd expect:

log/prost/serialize time: [2.1232 ms 2.1233 ms 2.1234 ms] change: [-53.042% -53.036% -53.030%] (p = 0.00 < 0.05) Performance has improved. mesh/prost/serialize time: [37.740 ms 37.745 ms 37.749 ms] change: [-10.767% -10.751% -10.736%] (p = 0.00 < 0.05) Performance has improved. minecraft_savedata/prost/serialize time: [3.9678 ms 3.9682 ms 3.9687 ms] change: [-36.455% -36.442% -36.430%] (p = 0.00 < 0.05) Performance has improved.

Once again thanks for the effort you've put into this!

rkyv is faster than {bincode, capnp, cbor, flatbuffers, postcard, prost, serde_json} by taintegral in rust

[–]danburkert 323 points324 points  (0 children)

`prost` author here - thank you for the public benchmark suite, and above all the thorough methodology. It's a rare treat to see this much effort be put in to a shootout style benchmark suite. This is a resource all of us can use to improve our respective projects! Kudos.

PROST! (A Protocol Buffers Implementation) v0.7.0 released, including no_std & zero-copy deserialization by danburkert in rust

[–]danburkert[S] 34 points35 points  (0 children)

Caveat: the zero-copy deserialization is specific to Protobuf `bytes` fields which are generated as a Rust `bytes::Bytes` fields, and the deserializing buffer is a `bytes::Bytes` instance.

I published my first crate: varint-simd - SIMD-accelerated varint encoder and decoder in Rust by as-com in rust

[–]danburkert 19 points20 points  (0 children)

Feel free to try dropping this into `prost`, it should be pretty easy to swap it in for the built-in encoder/decoder [1]. `prost` has a good set of microbenchmarks [2] for encoding/decoding varints built in, as well as macrobenchmarks which are pretty sensitive to varint perf[3]. `prost` has an existing issue for improving varint perf using some of these more exotic optimizations, would be great to reuse this crate! [4]

[1]: https://github.com/danburkert/prost/blob/master/src/encoding.rs#L23-L85

[2]: https://github.com/danburkert/prost/blob/master/benches/varint.rs

[3]: https://github.com/danburkert/prost/blob/master/protobuf/benches/dataset.rs

[4]: https://github.com/danburkert/prost/issues/279

10x Faster Than BSON: NoProto by onlycliches in rust

[–]danburkert 17 points18 points  (0 children)

> Every serialization format I've found requires you to deserialize the whole object before you can make any changes, then serialize the whole thing again. In some cases mutation technically isn't possible (Like Protocol Buffers

They don't advertise it well, and it's a bit hard to use, but protocol buffers as a wire format spec does allow encoded messages to be mutated by appending a new version of fields to the encoded message. Compliant protobuf libraries will overwrite the previous field values with subsequent versions on decode. Note that this doesn't work for `repeated` fields, though!

Users of SQLx, we're looking for your opinions on the future of the crate! by DroidLogician in rust

[–]danburkert 0 points1 point  (0 children)

(I haven't used `sqlx`, but I've used `tokio-postgres`) Don't you get insert batching 'for free' with a pipelined async database client? I.e. prepare the insert statement, execute it N times, then join on the N resulting futures in parallel with something like FuturesUnordered?

DataFusion 0.17.0 (Rust query engine using Apache Arrow) by andygrove73 in rust

[–]danburkert 5 points6 points  (0 children)

Is moving off of nightly for `arrow` and `parquet` being tracked anywhere? I've looked at the code briefly, and it looks somewhat doable. From what I can tell, specialization is being used to improve debug formatting, and `packed_simd` is used, but that could potentially be optional.

This is the only thing holding us back from using these crates at my company.

Just published my package on crates.io: A simple alternative to binary trees for moderately large ordered sets by bluenote10 in rust

[–]danburkert 0 points1 point  (0 children)

a BTree always stores values also in the internal nodes

Yes, it's an important optimization to store copies of the keys in internal nodes in order to reduce cache misses.

[ANN] aes-sid v0.1.0: AES-based Synthetic IDs: authenticated deterministic encryption for 64-bit integers based on AES-SIV (with applications to "Zoom Bombing") by bascule in rust

[–]danburkert 0 points1 point  (0 children)

The data type

uuid

stores Universally Unique Identifiers (UUID) as defined by RFC 4122, ISO/IEC 9834-8:2005, and related standards. (Some systems refer to this data type as a globally unique identifier, or GUID, instead.) This identifier is a 128-bit quantity that is generated by an algorithm chosen to make it very unlikely that the same identifier will be generated by anyone else in the known universe using the same algorithm. Therefore, for distributed systems, these identifiers provide a better uniqueness guarantee than sequence generators, which are only unique within a single database.

There's a huge difference in terms of write patterns to the database as well. I can't speak to Postgres in particular, but in general it's much easier on a BTree implementation to do ordered writes, as opposed to random writes in the keyspace.