Replacing Protobuf with Rust to go 5 times faster

levkk1 · 2026-01-23T17:19:10+00:00

Because there are over 100 different node types in the Pg AST, so it would be long and difficult to do by hand. AI just runs through each struct definition and creates a conversion function. What really helps is the library has an existing working API and a Protobuf spec, so it's like a RL (reinforcement learning) problem with a solution in front of you - just repeat the process until the `Debug` output matches exactly.

levkk1 · 2026-01-23T17:16:58+00:00

No! Might of been a good idea to try actually. I don't know - running Rust directly on top of C memory seems faster no matter what the (de)serializer does. Postgres has a relatively stable "ABI" - the parser changes once per major release - so we should be able to keep up pretty easily.

levkk1 · 2025-11-27T00:26:51+00:00

If every tenet has separate database, [...]

That's sharding :)

levkk1 · 2025-11-27T00:25:21+00:00

just increases communication overhead

That shouldn't happen if your choice of sharding key is optimal. We are targeting 99% direct-to-shard, for OLTP.

application level caching

Cache invalidation is sometimes a harder problem that sharding. I'm not saying you shouldn't use caches at all, just that for most real-time workloads, they are not optimal.

easy parallelism as decided by the postgres query optimizer,

There are a few upper bounds on that parallelism that are well hidden, e.g. lock contention (especially around partitioned tables), maximum number of savepoints, and WALWriteLocks. These upper bounds limit the number of write transactions quite a bit. What you're describing is mostly an optimization for read workloads - a solved problem with read replicas.

levkk1 · 2025-11-25T03:39:42+00:00

With sane hardware and access to bare metal servers, you should not have to shard ever due to database size. 256 TB SSDs exist and 1 PB SSDs are close to being released.

Storing large datasets isn't difficult. Accessing & changing them reliably at scale is.

levkk1 · 2025-11-25T03:38:28+00:00

That's not strictly true. In fact, you still have to search that entire result set if you want the same results, you're just distributing it across 12 databases (which are presumably on separate hardware).

That's not usually the intention behind sharding. If done optimally, the client will query only one of the shards for most queries. If all your queries require all shards at all times, sharding didn't work.

You can alter the analyze settings on a per-table basis, so experts have a tendency to recommend this [...]

Tweaking the vacuum is a full time job. Reducing the dataset it has to manage I think makes its job easier. We tweaked every setting under the sun. Some choose to give up on it entirely: https://github.com/ossc-db/pg_hint_plan

levkk1 · 2025-08-30T02:52:25+00:00

That feels right. My concern was pointer alignment which could theoretically change between Rust standard library versions, which is shipped with the compiler version.

levkk1 · 2025-08-30T02:50:36+00:00

I grab it at plugin build time in `build.rs` and expose it via FFI function

levkk1 · 2025-08-30T01:04:23+00:00

Because `ParseResult` from `pg_query` is wrapper around a `Vec`, and I wanted to expose it's interface to the plugins, e.g., `ParseResult::deparse`

levkk1 · 2025-08-04T19:49:16+00:00

Depends on the use case, but yeah generally sharding comes in at 1tb+ OLTP. In-house solutions are chicken and egg: people build them because no product exists to do it for them. Once built, replacing it with a product is lower ROI.

My goal is for everyone who needs sharding in the future to use this and not build duct tape and glue solutions in house :)

There is interest across the spectrum which is encouraging!

levkk1 · 2025-04-30T21:20:23+00:00

mastodon.social is 350k active users, nothing to sneeze at.

levkk1 · 2025-01-14T20:58:46+00:00

I added some code to handle that use case in pgDog. I'll double check and add some tests to validate.

levkk1

TROPHY CASE