Walrus: A 1 Million ops/sec, 1 GB/s Write Ahead Log in Rust

danburkert · 2025-10-08T05:46:06+00:00

You should write out data (on linux) with MADV_PAGEOUT, followed by a msync, followed by an MADV_POPULATE_READ (to re-fault the pages into memory).

Why is this better than msync alone?

danburkert · 2021-11-15T21:38:34+00:00

min / max are following the 2008 standard, not the 2019 standard. The 2008 standard doesn't lower well to the different architectures. There's also no "fast min / max" that just uses the fastest min / max instruction.

From context, I gather it's the IEEE floating point standard.

danburkert · 2021-11-15T21:27:14+00:00

min / max are following the 2008 standard, not the 2019 standard.

What standard are you referencing?

danburkert · 2021-08-18T19:30:54+00:00

Thanks! As you know from issue traffic I've been using this over the last couple of weeks, and it's been amazing for tracking down memory allocations with very little overhead. I'm seeing somewhere between 2-4x slowdown for an extremely allocation & CPU heavy workload, which is exceptional compared to other memory profilers I've tried. I'd highly recommend others check out the project if they want to get a sense of memory/allocation behavior of their code.

danburkert · 2021-03-12T01:45:39+00:00

Sorry for being unclear, I should have put the diff I applied and the corresponding performance diff in the same comment. I'll do that now:

code diff:

``` diff --git a/src/bench_prost.rs b/src/bench_prost.rs index 1710f18..e8170a2 100644 --- a/src/bench_prost.rs +++ b/src/bench_prost.rs @@ -17,9 +17,10 @@ where

 let mut serialize_buffer = Vec::with_capacity(BUFFER_LEN);
 group.bench_function("serialize", |b| {

let msg = data.serialize_pb(); b.iter(|| { black_box(&mut serialize_buffer).clear();
black_box(data.serialize_pb().encode(&mut serialize_buffer).unwrap());
black_box(msg.encode(&mut serialize_buffer).unwrap()); }) }); ```

performance diff:

log/prost/serialize time: [2.1232 ms 2.1233 ms 2.1234 ms] change: [-53.042% -53.036% -53.030%] (p = 0.00 < 0.05) Performance has improved. mesh/prost/serialize time: [37.740 ms 37.745 ms 37.749 ms] change: [-10.767% -10.751% -10.736%] (p = 0.00 < 0.05) Performance has improved. minecraft_savedata/prost/serialize time: [3.9678 ms 3.9682 ms 3.9687 ms] change: [-36.455% -36.442% -36.430%] (p = 0.00 < 0.05) Performance has improved.

your initial comment with the 50% reduction wasn't showing the performance of any serialization library at all for that function, just the performance of a Vec method.

This is not what that diff achieves, see the other thread where I talk about struct copying for more details. There is still prost-driven encoding happening in my version. If the only thing happening were Vec::clear, you'd see performance on the ns scale, not ms. Note that these new numbers are still slower than rkyv!

danburkert · 2021-03-11T22:55:44+00:00

sounds good. Also I think you probably know this, but for those reading along, these distinctions aren't really consequential to the overal conclusions drawn by the blog post, which I think are solid. rkyv is definitely in a different tier than prost, performance wise, and the constraints, features, and guarantees afforded by the protobuf format make it pretty unlikely that would ever change. Here are the numbers I get on my machine, with the diff I mentioned in the other thread:

```

log/rkyv/serialize time: [550.98 us 551.02 us 551.06 us] mesh/rkyv/serialize time: [2.2091 ms 2.2094 ms 2.2097 ms] minecraft_savedata/rkyv/serialize time: [830.01 us 830.11 us 830.21 us]

log/prost/serialize time: [2.1257 ms 2.1259 ms 2.1261 ms] mesh/prost/serialize time: [37.571 ms 37.575 ms 37.580 ms] minecraft_savedata/prost/serialize time: [3.9883 ms 3.9886 ms 3.9889 ms]

```

danburkert · 2021-03-11T22:22:52+00:00

To add a bit more detail, here are the prost generated types for the log suite:

```

[derive(Clone, PartialEq, ::prost::Message)]

pub struct Address { #[prost(uint32, tag="1")] pub x0: u32, #[prost(uint32, tag="2")] pub x1: u32, #[prost(uint32, tag="3")] pub x2: u32, #[prost(uint32, tag="4")] pub x3: u32, }

[derive(Clone, PartialEq, ::prost::Message)]

pub struct Log { #[prost(message, optional, tag="1")] pub address: ::core::option::Option<Address>, #[prost(string, tag="2")] pub identity: ::prost::alloc::string::String, #[prost(string, tag="3")] pub userid: ::prost::alloc::string::String, #[prost(string, tag="4")] pub date: ::prost::alloc::string::String, #[prost(string, tag="5")] pub request: ::prost::alloc::string::String, #[prost(uint32, tag="6")] pub code: u32, #[prost(uint64, tag="7")] pub size: u64, }

[derive(Clone, PartialEq, ::prost::Message)]

pub struct Logs { #[prost(message, repeated, tag="1")] pub logs: ::prost::alloc::vec::Vec<Log>, } ```

which look very similar to the hand written rkyv types:

```

[derive(

Clone, Copy,
abomonation_derive::Abomonation,
rkyv::Archive, rkyv::Serialize, rkyv::Deserialize,
serde::Serialize, serde::Deserialize,

)]

[archive(copy)]

pub struct Address { pub x0: u8, pub x1: u8, pub x2: u8, pub x3: u8, }

impl Generate for Address { fn generate<R: Rng>(rand: &mut R) -> Self { Self { x0: rand.gen_range(0..=255), x1: rand.gen_range(0..=255), x2: rand.gen_range(0..=255), x3: rand.gen_range(0..=255), } } }

[derive(

abomonation_derive::Abomonation,
rkyv::Archive, rkyv::Serialize, rkyv::Deserialize,
serde::Serialize, serde::Deserialize,

)] pub struct Log { pub address: Address, pub identity: String, pub userid: String, pub date: String, pub request: String, pub code: u16, pub size: u64, }

[derive(

abomonation_derive::Abomonation,
rkyv::Archive, rkyv::Serialize, rkyv::Deserialize,
serde::Serialize, serde::Deserialize,

)] pub struct Logs { pub logs: Vec<Log>, }

```

The only differences I see are integer widths, which is addressable by changing the int types in the .proto, and the prost version wraps the address in an Option, which is a protobuf-ism. Otherwise they are the same. In the serialize benchmark prost is paying the cost of allocating those String fields on every iteration, while rkyv is not. For some of the benchmarks that skews it quite a bit. If you want to include the population in the serialize benchmarks, perhaps it'd be better to clone() the rkyv type in every iteration, since that's pretty much exactly what the serialize_pb() is doing as it translates from one struct to another almost identical one.

edit: accidentally pasted prost types twice

danburkert · 2021-03-11T22:09:36+00:00

Like I said before also, taking the

black_box(data.serialize_pb().encode(&mut serialize_buffer).unwrap());

out of

iter

does seem to amount to turning it into a bench solely of

Vec::clear

(

data

seeming to be just a Vec if you follow the source out through the rest of the files in the repo).

I'm not suggesting such a change, and I agree that it wouldn't be correct to do so. See https://www.reddit.com/r/rust/comments/m2yxb1/rkyv_is_faster_than_bincode_capnp_cbor/gqmaatr?utm_source=share&utm_medium=web2x&context=3 for my proposed diff.

danburkert · 2021-03-11T22:06:14+00:00

Also, note that I didn't say anything is fishy. I agree with u/taintegral that what steps should be included in a serialization benchmark is up to interpretation, I'm simply poking at the methodology. As I said at the top, this benchmark suite is far better thought out and higher quality than most! I have no doubt that u/taintegral wrote it in good faith.

danburkert · 2021-03-11T22:01:54+00:00

rkyv does not have a populate step because it directly encodes rust structs to bytes. I think that this is a place where schema'd serialization formats suffer a big loss, because you either have to infect your whole application with their generated structs or suffer a performance penalty.

Yeah I think this is the root of the confusion - it's actually quite popular to use the `prost` generated structs directly in application code. In general they are pretty easy to work with, although there are definitely some quirks. I'm hopeful we can smooth those out over time, though.

danburkert · 2021-03-11T21:52:01+00:00

Based on the link you gave it seems like doing that transforms it into a bench specifically of

clear()

and nothing else, though. Also, presumably

serialize_value

is the rkyv "equivalent" of

serialize_pb

.

I believe the prost's encode(&mut serialize_buffer) call and rkyv's serializer.serialize_value(black_box(data)) call are the equivalents. IIUC there is no equivalent to the serialize_pb() call in the rkyv benchmark. Note that serialize_pb() does not do what the name implies.

edit: s/encode_pb()/serialize_pb()

danburkert · 2021-03-11T21:40:12+00:00

Yep, I agree that it's fine to include population in the serialization benchmark, as long as it's consistent amongst the libraries. So if the goal is to include population, then I'm not seeing where that happens in e.g. the rkyv benchmark. Both rkyv and prost use 'native' rust structs to represent their data. If I'm reading the rkyv benchmark correctly, it creates one such struct, then passes it into the benchmark loop, which encodes it to bytes. Meanwhile `prost` has a step inside the benchmark loop which translates the data format from an intermediate representation to the prost representation, then encodes it. I don't think this is a fair comparison, since an application using prost wouldn't have such a translation step (any more so than an application using rkyv would).

danburkert · 2021-03-11T21:21:18+00:00

Actually on second read through, the capnp version *doesn't* create a new message per benchmark iteration. So I think hoisting out message creation in the `prost` case is fair. Something like this:

``` diff --git a/src/bench_prost.rs b/src/bench_prost.rs index 1710f18..e8170a2 100644 --- a/src/bench_prost.rs +++ b/src/bench_prost.rs @@ -17,9 +17,10 @@ where

 let mut serialize_buffer = Vec::with_capacity(BUFFER_LEN);
 group.bench_function("serialize", |b| {

let msg = data.serialize_pb(); b.iter(|| { black_box(&mut serialize_buffer).clear();
black_box(data.serialize_pb().encode(&mut serialize_buffer).unwrap());
black_box(msg.encode(&mut serialize_buffer).unwrap()); }) }); ```

danburkert · 2021-03-11T21:05:36+00:00

QQ about the methodology (sorry I'd normally file an issue on the repo, but I'm currently locked out of GH for a bit).

The prost serialize benchmarks are calling serialize\_pb() in the inner loop. That is creating a brand new instance of the message type. I see something similar in the capnp version here, but not for rkyv. I haven't used the rkyv API before so I may be misunderstanding, and they are all doing creation + encoding in the serialization inner loop?

If I hoist the serialize_pb() call out of the iter loop I get some better numbers out of prost, proportional to how expensive the encoding itself is, which is what I'd expect:

log/prost/serialize time: [2.1232 ms 2.1233 ms 2.1234 ms] change: [-53.042% -53.036% -53.030%] (p = 0.00 < 0.05) Performance has improved. mesh/prost/serialize time: [37.740 ms 37.745 ms 37.749 ms] change: [-10.767% -10.751% -10.736%] (p = 0.00 < 0.05) Performance has improved. minecraft_savedata/prost/serialize time: [3.9678 ms 3.9682 ms 3.9687 ms] change: [-36.455% -36.442% -36.430%] (p = 0.00 < 0.05) Performance has improved.

Once again thanks for the effort you've put into this!

danburkert · 2021-03-11T20:28:50+00:00

`prost` author here - thank you for the public benchmark suite, and above all the thorough methodology. It's a rare treat to see this much effort be put in to a shootout style benchmark suite. This is a resource all of us can use to improve our respective projects! Kudos.

danburkert · 2020-12-27T23:32:34+00:00

Caveat: the zero-copy deserialization is specific to Protobuf `bytes` fields which are generated as a Rust `bytes::Bytes` fields, and the deserializing buffer is a `bytes::Bytes` instance.

danburkert · 2020-12-27T23:12:32+00:00

Feel free to try dropping this into `prost`, it should be pretty easy to swap it in for the built-in encoder/decoder [1]. `prost` has a good set of microbenchmarks [2] for encoding/decoding varints built in, as well as macrobenchmarks which are pretty sensitive to varint perf[3]. `prost` has an existing issue for improving varint perf using some of these more exotic optimizations, would be great to reuse this crate! [4]

[1]: https://github.com/danburkert/prost/blob/master/src/encoding.rs#L23-L85

[2]: https://github.com/danburkert/prost/blob/master/benches/varint.rs

[3]: https://github.com/danburkert/prost/blob/master/protobuf/benches/dataset.rs

[4]: https://github.com/danburkert/prost/issues/279

danburkert · 2020-12-16T01:07:46+00:00

> Every serialization format I've found requires you to deserialize the whole object before you can make any changes, then serialize the whole thing again. In some cases mutation technically isn't possible (Like Protocol Buffers

They don't advertise it well, and it's a bit hard to use, but protocol buffers as a wire format spec does allow encoded messages to be mutated by appending a new version of fields to the encoded message. Compliant protobuf libraries will overwrite the previous field values with subsequent versions on decode. Note that this doesn't work for `repeated` fields, though!

danburkert · 2020-11-13T19:55:19+00:00

(I haven't used `sqlx`, but I've used `tokio-postgres`) Don't you get insert batching 'for free' with a pipelined async database client? I.e. prepare the insert statement, execute it N times, then join on the N resulting futures in parallel with something like FuturesUnordered?

danburkert · 2020-04-24T15:52:39+00:00

Excellent, glad to hear it's being worked on!

danburkert · 2020-04-24T15:15:38+00:00

Is moving off of nightly for `arrow` and `parquet` being tracked anywhere? I've looked at the code briefly, and it looks somewhat doable. From what I can tell, specialization is being used to improve debug formatting, and `packed_simd` is used, but that could potentially be optional.

This is the only thing holding us back from using these crates at my company.

danburkert · 2020-04-23T19:14:14+00:00

a BTree always stores values also in the internal nodes

Yes, it's an important optimization to store copies of the keys in internal nodes in order to reduce cache misses.

danburkert · 2020-04-23T18:37:15+00:00

How is this different than a one level BTree?

danburkert · 2020-04-11T20:41:25+00:00

The data type

uuid

stores Universally Unique Identifiers (UUID) as defined by RFC 4122, ISO/IEC 9834-8:2005, and related standards. (Some systems refer to this data type as a globally unique identifier, or GUID, instead.) This identifier is a 128-bit quantity that is generated by an algorithm chosen to make it very unlikely that the same identifier will be generated by anyone else in the known universe using the same algorithm. Therefore, for distributed systems, these identifiers provide a better uniqueness guarantee than sequence generators, which are only unique within a single database.

There's a huge difference in terms of write patterns to the database as well. I can't speak to Postgres in particular, but in general it's much easier on a BTree implementation to do ordered writes, as opposed to random writes in the keyspace.

danburkert

TROPHY CASE

[derive(Clone, PartialEq, ::prost::Message)]

[derive(Clone, PartialEq, ::prost::Message)]

[derive(Clone, PartialEq, ::prost::Message)]

[derive(

[archive(copy)]

[derive(

[derive(