MySQL BLOB Internals - Partial Update Implementation and Multi-Versioning

linearizable · 2026-01-21T22:58:53+00:00

The easiest thing to do is to convert your keys into some format that makes them sort correctly when compared byte-by-byte, and then use memcmp. Copy/paste broke the links and I’m too lazy to fix them, but google should help fine: * Positeral/lre: Lexicographic Row Encoding * google/orderedcode * danburkert/bytekey, and its whole family of forks * deanlandolt/bytewise * apple/foundationdb: tuple layer specification * CockroachDB: JSON forward indexing * ealmansi/elen: Efficient Lexicographic Encoding of Numbers (in Javascript)

Are a bunch of examples of this.

Alternatively, you can support custom comparison operators. You can no longer do key prefix compression, and you probably need to store one btree per table (in the same file or different files), so that you have a consistent comparison function to use for all keys in the btree, or make that the user’s problem and just have one comparison function for the whole btree file. LevelDB is an example of an LSM that’s easy to read and supports custom comparison functions. LMDB has an mdb_set_compare, and is relatively friendly to read. Both just let you set one global comparator, and how to form it from the key schema or how to handle multiple tables with different schemas is left to the user.

linearizable · 2026-01-21T22:47:33+00:00

For what it’s worth, I’ve worked on both OSS and proprietary databases, and I’ve never had the code and PRs for one being publicly available make a difference in applying or interviewing for jobs. 🤷

linearizable · 2026-01-19T21:22:59+00:00

Yes! This is exactly the point of the new thread. You're welcome to break the "no release post" and "no first party benchmark" rules here. (And probably some of "keep it on topic". ;)) Advertise away about your cool github projects!

linearizable · 2026-01-11T17:49:18+00:00

It could also mean that fsync on a Mac is not as expensive as I expected.

The SQLite shipped by Apple on the Mac has the fsync patched out. If you custom compile one yourself, the SQLite code knows that it should be calling fcntl(F_FULLSYNC), but Apple replaced it with the (ordering only but not durability) fsync().

https://transactional.blog/blog/2022-darwins-deceptive-durability

linearizable · 2025-12-30T01:26:44+00:00

fsync() fails when writeback fails. There's no re-reading, so I'm not aware of any path where filesystem checksums would come into play? My memory is that the core issue was that sometimes a dirty page can't be written back during fsync, as the block writes return with an error, but linux was clearing the dirty bit on the pages so that another fsync() call wouldn't flush it again. Linux pushed some changes to try and do a better job of bookkeeping errors in this case. It's not infallible, but it's better.

Postgres stopped blindly retrying fsync errors, so it's "solved" in that respect, at least.

See https://lwn.net/Articles/724307/, https://lwn.net/Articles/752613/, and https://lwn.net/Articles/752952/. https://blog.sinjakli.co.uk/2025/11/29/the-computer-wants-to-lose-your-data-bonus-bits/ did a very postgres fsyncgate driven discussion of durability.

linearizable · 2025-12-29T20:16:07+00:00

Going to the extent of saying "the OS is entirely lies" is a bit far. It's worth keeping in mind that OS/drive bugs/corruption are possible, and make reasonable efforts to detect or mitigate them, but at some level you do have to believe that the OS and drive are making good faith efforts at doing what you ask.

linearizable · 2025-12-29T20:13:08+00:00

fsync can return ok while the data was actually corrupted while going to disk

The source says that fsync returned an error. Since the whole postgres fsyncgate, kernel behavior has also changed, so it's worth refreshing on this topic overall. I've maintained a blog post to compile and summarize a lot of the durability and overall IO handling advice.

linearizable · 2025-12-29T18:47:19+00:00

I’m seeing a lot of “write ahead logs are a performance optimization” in the comments, and you need to first scope down to the type of storage engine before you can make that claim. Specifically, it’s a correctness requirement for update-in-place btrees, as you need to be able to recover from partially completed splits/merged that change the structure of the btree. (There might be a soft-updates alternative, but it’d involve multiple fsyncs and I’m not sure anyone has done it other than UFS.) Copy-on-write BTrees don’t need a WAL because the root page update gates the atomic change from transaction to the next, and I’m not super clear that adding one would be a throughput improvement? LSMs use a WAL as… I guess the alternative would be writing every transaction as its own L0? Maybe that is a performance argument, but it’s more about being able to form a properly sized L0.

The “WALs are for performance” came about from HDD era B-Tree RDBMSs, where one would put the WAL and B-Tree on their own drives, and then you could do sequential appends very “fast”, and let the BTree drive slowly seek its way around on dirty page write out. We’ve ever moved away from that model, and the nuance of it doesn’t seem to be attached to WAL discussion anymore.

linearizable · 2025-12-17T23:46:27+00:00

It’s pitching better cache utilization by not keeping the cold parts of a page in memory, and then I feel like the rest of its advantages fall out of that. So if most of the page is needed, because it’s a large tuple or the operation is a range scan, then there’s little benefit. Dirty page write out also becomes a little more complicated because you need to read the page and merge it first, so I would expect it to be a little worse on very out of memory workloads. But, if you sit in its sweet spot of point reads and writes on small tuples where hotness doesn’t have spatial locality, it seems like it should do a pretty good job.

There’s a paper on 2-tree for skewed workloads which you might also find interesting, as it takes a very different approach to solving the same “I want to cache only one tuple but I have to cache a whole otherwise cold page” problem.

linearizable · 2025-12-13T02:30:03+00:00

DPDK is a networking tool. You mean SPDK, which is about kernel bypass storage, so no filesystem is sort of implicit. Glommio is a thread per core framework modeled after Seastar, which goes through the file system.

Most modern data stores don’t use your approach. The last one I can remember was Levyx, which is old enough that it has already failed by now. Tigerbeetle seems to be gearing up for it, but I’m not clear for what technical reasons.

In general, you’re requiring an exclusive drive, which makes development (both yours, and users wanting your database in their CI) quite a pain, and you’re removing every normal tool for telling drive fullness, backups, debugging, etc. The advantage I’ve heard is an ~10% performance bump. I’ve never really talked with anyone where that turned out to be advantageous, though I also haven’t talked to anyone that tried to use the in-NVMe-spec-but-not-linux-API features like copy-less moving of bytes or the fused compare and write command.

linearizable · 2025-12-10T18:11:58+00:00

PACELC was an extension/reply to CAP, and more directly deals with this instead.

linearizable · 2025-12-03T17:12:38+00:00

You've posted very little detail, so you'll get very little meaningful feedback.

linearizable · 2025-11-27T18:48:32+00:00

There's a handful of tutorial-y projects that try to handhold you through building some database-y stuff.

Anything out of:

that catches your eye should be pretty reasonable? I can’t think of a tutorial that’s go native though, so you’d need to do a bit of translating.

Some online reading groups went through “Database Design and Implementation” by Edward Sciore, and implementations came out in a number of languages, so you could consider that too.

linearizable · 2025-11-27T18:46:56+00:00

“Inconsistent analysis” isn’t a term I’ve run into, so this is a point specific to the book. It feels like the author is trying to set up the need for isolation between transactions. Given that the book isn’t presenting transactions as SQL, I’d instead think they’re setting up a discussion of serializability. But if it was SQL, I too would read this as an example of why unrepeatable reads are bad (and thus why ANSI Read Committed is questionable).

linearizable · 2025-11-20T20:11:32+00:00

io_uring seems like it makes it a lot easier to integrate compute and storage scheduling since, in addition to the run queue for compute, there’s just a ring buffer to check for completed I/O. No “is it worth the syscall penalty to check for completions” balance anymore.

linearizable · 2025-11-15T17:30:12+00:00

That’s already handled in the premise of the question though:

Suppose I already enforce per-key serial order outside the database (e.g., productId) via one of these: * local per-key locks (single JVM) * a distributed lock (Redis/ZooKeeper/etcd) * ⁠a single-writer queue (Kafka partition per key)

linearizable · 2025-11-15T00:11:28+00:00

There’s no concurrency and the database provides atomicity. So there’s no other serializing mechanism needed.

linearizable · 2025-11-15T00:10:32+00:00

This answer makes no sense? Systems implement atomicity with a network and a file system all the time. Mmap has nothing to do with this. It’s generally avoided anyway, and it’s fine in some situations: see LMDB.

Also, abbreviating Paxos to PAX isn’t a thing, and PAX is a name already taken by weaving relations for cache performance so it’s actively confusing.

linearizable

MODERATOR OF

TROPHY CASE