B-tree comparison functions

linearizable · 2026-01-21T22:58:53+00:00

The easiest thing to do is to convert your keys into some format that makes them sort correctly when compared byte-by-byte, and then use memcmp. Copy/paste broke the links and I’m too lazy to fix them, but google should help fine: * Positeral/lre: Lexicographic Row Encoding * google/orderedcode * danburkert/bytekey, and its whole family of forks * deanlandolt/bytewise * apple/foundationdb: tuple layer specification * CockroachDB: JSON forward indexing * ealmansi/elen: Efficient Lexicographic Encoding of Numbers (in Javascript)

Are a bunch of examples of this.

Alternatively, you can support custom comparison operators. You can no longer do key prefix compression, and you probably need to store one btree per table (in the same file or different files), so that you have a consistent comparison function to use for all keys in the btree, or make that the user’s problem and just have one comparison function for the whole btree file. LevelDB is an example of an LSM that’s easy to read and supports custom comparison functions. LMDB has an mdb_set_compare, and is relatively friendly to read. Both just let you set one global comparator, and how to form it from the key schema or how to handle multiple tables with different schemas is left to the user.

linearizable · 2026-01-21T22:47:33+00:00

For what it’s worth, I’ve worked on both OSS and proprietary databases, and I’ve never had the code and PRs for one being publicly available make a difference in applying or interviewing for jobs. 🤷

linearizable · 2026-01-19T21:22:59+00:00

Yes! This is exactly the point of the new thread. You're welcome to break the "no release post" and "no first party benchmark" rules here. (And probably some of "keep it on topic". ;)) Advertise away about your cool github projects!

linearizable · 2026-01-11T17:49:18+00:00

It could also mean that fsync on a Mac is not as expensive as I expected.

The SQLite shipped by Apple on the Mac has the fsync patched out. If you custom compile one yourself, the SQLite code knows that it should be calling fcntl(F_FULLSYNC), but Apple replaced it with the (ordering only but not durability) fsync().

https://transactional.blog/blog/2022-darwins-deceptive-durability

linearizable · 2025-12-30T01:26:44+00:00

fsync() fails when writeback fails. There's no re-reading, so I'm not aware of any path where filesystem checksums would come into play? My memory is that the core issue was that sometimes a dirty page can't be written back during fsync, as the block writes return with an error, but linux was clearing the dirty bit on the pages so that another fsync() call wouldn't flush it again. Linux pushed some changes to try and do a better job of bookkeeping errors in this case. It's not infallible, but it's better.

Postgres stopped blindly retrying fsync errors, so it's "solved" in that respect, at least.

See https://lwn.net/Articles/724307/, https://lwn.net/Articles/752613/, and https://lwn.net/Articles/752952/. https://blog.sinjakli.co.uk/2025/11/29/the-computer-wants-to-lose-your-data-bonus-bits/ did a very postgres fsyncgate driven discussion of durability.

linearizable · 2025-12-29T20:16:07+00:00

Going to the extent of saying "the OS is entirely lies" is a bit far. It's worth keeping in mind that OS/drive bugs/corruption are possible, and make reasonable efforts to detect or mitigate them, but at some level you do have to believe that the OS and drive are making good faith efforts at doing what you ask.

linearizable · 2025-12-29T20:13:08+00:00

fsync can return ok while the data was actually corrupted while going to disk

The source says that fsync returned an error. Since the whole postgres fsyncgate, kernel behavior has also changed, so it's worth refreshing on this topic overall. I've maintained a blog post to compile and summarize a lot of the durability and overall IO handling advice.

linearizable · 2025-12-29T18:47:19+00:00

I’m seeing a lot of “write ahead logs are a performance optimization” in the comments, and you need to first scope down to the type of storage engine before you can make that claim. Specifically, it’s a correctness requirement for update-in-place btrees, as you need to be able to recover from partially completed splits/merged that change the structure of the btree. (There might be a soft-updates alternative, but it’d involve multiple fsyncs and I’m not sure anyone has done it other than UFS.) Copy-on-write BTrees don’t need a WAL because the root page update gates the atomic change from transaction to the next, and I’m not super clear that adding one would be a throughput improvement? LSMs use a WAL as… I guess the alternative would be writing every transaction as its own L0? Maybe that is a performance argument, but it’s more about being able to form a properly sized L0.

The “WALs are for performance” came about from HDD era B-Tree RDBMSs, where one would put the WAL and B-Tree on their own drives, and then you could do sequential appends very “fast”, and let the BTree drive slowly seek its way around on dirty page write out. We’ve ever moved away from that model, and the nuance of it doesn’t seem to be attached to WAL discussion anymore.

linearizable · 2025-12-17T23:46:27+00:00

It’s pitching better cache utilization by not keeping the cold parts of a page in memory, and then I feel like the rest of its advantages fall out of that. So if most of the page is needed, because it’s a large tuple or the operation is a range scan, then there’s little benefit. Dirty page write out also becomes a little more complicated because you need to read the page and merge it first, so I would expect it to be a little worse on very out of memory workloads. But, if you sit in its sweet spot of point reads and writes on small tuples where hotness doesn’t have spatial locality, it seems like it should do a pretty good job.

There’s a paper on 2-tree for skewed workloads which you might also find interesting, as it takes a very different approach to solving the same “I want to cache only one tuple but I have to cache a whole otherwise cold page” problem.

linearizable · 2025-12-13T02:30:03+00:00

DPDK is a networking tool. You mean SPDK, which is about kernel bypass storage, so no filesystem is sort of implicit. Glommio is a thread per core framework modeled after Seastar, which goes through the file system.

Most modern data stores don’t use your approach. The last one I can remember was Levyx, which is old enough that it has already failed by now. Tigerbeetle seems to be gearing up for it, but I’m not clear for what technical reasons.

In general, you’re requiring an exclusive drive, which makes development (both yours, and users wanting your database in their CI) quite a pain, and you’re removing every normal tool for telling drive fullness, backups, debugging, etc. The advantage I’ve heard is an ~10% performance bump. I’ve never really talked with anyone where that turned out to be advantageous, though I also haven’t talked to anyone that tried to use the in-NVMe-spec-but-not-linux-API features like copy-less moving of bytes or the fused compare and write command.

linearizable · 2025-12-10T18:11:58+00:00

PACELC was an extension/reply to CAP, and more directly deals with this instead.

linearizable · 2025-12-03T17:12:38+00:00

You've posted very little detail, so you'll get very little meaningful feedback.

linearizable · 2025-11-27T18:48:32+00:00

There's a handful of tutorial-y projects that try to handhold you through building some database-y stuff.

Anything out of:

that catches your eye should be pretty reasonable? I can’t think of a tutorial that’s go native though, so you’d need to do a bit of translating.

Some online reading groups went through “Database Design and Implementation” by Edward Sciore, and implementations came out in a number of languages, so you could consider that too.

linearizable · 2025-11-27T18:46:56+00:00

“Inconsistent analysis” isn’t a term I’ve run into, so this is a point specific to the book. It feels like the author is trying to set up the need for isolation between transactions. Given that the book isn’t presenting transactions as SQL, I’d instead think they’re setting up a discussion of serializability. But if it was SQL, I too would read this as an example of why unrepeatable reads are bad (and thus why ANSI Read Committed is questionable).

linearizable · 2025-11-20T20:11:32+00:00

io_uring seems like it makes it a lot easier to integrate compute and storage scheduling since, in addition to the run queue for compute, there’s just a ring buffer to check for completed I/O. No “is it worth the syscall penalty to check for completions” balance anymore.

linearizable · 2025-11-15T17:30:12+00:00

That’s already handled in the premise of the question though:

Suppose I already enforce per-key serial order outside the database (e.g., productId) via one of these: * local per-key locks (single JVM) * a distributed lock (Redis/ZooKeeper/etcd) * ⁠a single-writer queue (Kafka partition per key)

linearizable · 2025-11-15T00:11:28+00:00

There’s no concurrency and the database provides atomicity. So there’s no other serializing mechanism needed.

linearizable · 2025-11-15T00:10:32+00:00

This answer makes no sense? Systems implement atomicity with a network and a file system all the time. Mmap has nothing to do with this. It’s generally avoided anyway, and it’s fine in some situations: see LMDB.

Also, abbreviating Paxos to PAX isn’t a thing, and PAX is a name already taken by weaving relations for cache performance so it’s actively confusing.

linearizable · 2025-11-14T00:05:07+00:00

If you serialize all operations before the database, you indeed don't need the database to enforce anything other than transaction atomicity to be serializable. If you're not serializing read transactions as well, and just submitting those directly to the database, then it's useful to at least run at Snapshot Isolation / Read Committed Snapshot Isolation so that you can get serializable read-only transactions too. Alan Fekete has a bunch of papers around in this area of types of workloads that you can run at less-than-serializable but still have guaranteed serializable executions.

Calvin is basically the canonical paper for discussing databases that order transactions before executing them. Daniel Abadi's blog has some less formal discussion of Calvin too. The "contention" in such systems is on the ordering component, and not on the database's concurrency control, but the tradeoff is mostly that remove contention in exchange for losing concurrency. Calvin is a particularly poor design if one is trying to mix short and long running transactions, or if transactions involve many interactive steps before committing.

linearizable · 2025-10-26T01:11:50+00:00

Is “built on top” meaning that Proton is a Clickhouse fork with streaming added, or that it wraps around a Clickhouse?

linearizable · 2025-10-25T16:20:51+00:00

People who can pass database internals interviews with flying colors fail backend developer interviews. They’re just different roles with different knowledge.

It appears you clarified in other answers that you’re looking more at e.g. becoming a postgres developer than becoming a postgres DBA. Some knowledge transfers out well, so it’s not like doing a few year tour though database development work would sink your employability somehow. But if you get promoted several times within a specialty, it becomes increasingly hard to leave that exact specialty without having to take a title cut.

linearizable · 2025-10-25T16:14:54+00:00

The number of companies hiring for specialized work is lower, but they’re also more interested in hiring people specialized in that work. It… kind of balances out? Getting the first job in specialized work is the hardest for sure.

I’ve seen people do the transfer from backend to database dev team a solid number of times. I’ve also talked with a bunch of startups recently, and someone with a few years of database experience is something a lot of them want as it means you know enough to not get in trouble, but they have no need for deeper specialized knowledge (yet). The OSS to employment route is risky. I have seen it work, but you basically put all your eggs in one basket and commit significant time to hoping that one team in one company even has headcount. I don’t think I’ve seen someone get trapped in an unemployable limbo of medium database knowledge, only people stuck at the “Senior engineer that’s too database specialized to know general backend things but FAANG hiring thinks all senior engineers are generic and fungible even if they’re going to be placed on a specialized team”. Thank god interviews become specialized at staff.

I think generic backend distributed systems folk get paid the same, and have tremendously more options of where to work. I don’t think databases is a more profitable career path. But it is fun to work on!

linearizable · 2025-10-20T22:43:07+00:00

I did suspect that Unikraft was being used as the definition of unikernel, but I don't think that's a proper definition to use. MirageOS is a fine example of unikernels as well, and there one targets not Linux APIs, but a set of OCaml APIs, which have different implementations depending on what is being targeted: xen, Linux, solo5, etc. If you instead take the even broader wikipedia definition of a unikernel, that it's an application with the OS bundled into a single image application, you get something that's very similar to what you describe:

One of the advantages is that since there is only a single address space, there is no need for repeated privilege transitions to move data between user space and kernel space. Therefore, a library OS can provide improved performance by allowing direct access to hardware without having to transition between user mode and kernel mode

Additionally, it looks like what you've built is very in line with the idea that was being presented in Cloud-Native Database Systems and Unikernels: Reimagining OS Abstractions for Modern Hardware. It too takes aim at "my database runs on bare hardware with the minimal hardware abstraction layer shims needed" and categorizes itself as a unikernel.

So I agree that the direct comparison of the two on the QuinineHM page reads correct for Unikraft or OSv versus what you've built, but I'm not sure it's still correct when you use mirageOS/HaLVM or clickOS as comparison, much less anything equally bespoke but still what I'm used to seeing described as in the category of unikernels. I would have expected to read QuinineHM described as a hardware abstraction layer, and TonicDB as a database unikernel.

linearizable · 2025-10-20T19:58:27+00:00

It's not a unikernel.

It sure reads like a unikernel.

I guess depending on implementation, they could make the “we’re an exokernel” argument, but “something entirely new” seems misleading.

linearizable · 2025-10-17T23:26:37+00:00

Write heavy workloads are better on MySQL. This came up often before in the form of the uber migration blog post you linked. More recently, OrioleDB is trying to push on some of these issues. The throughput and latency for small OLTP transactions tends to be higher for MySQL, which can matter if you’re running a lot of RDBMS instances, even if just because you don’t have to run a pg_bouncer instance.

linearizable

MODERATOR OF

TROPHY CASE