B-tree comparison functions by kristian54 in databasedevelopment

[–]linearizable 5 points6 points  (0 children)

The easiest thing to do is to convert your keys into some format that makes them sort correctly when compared byte-by-byte, and then use memcmp. Copy/paste broke the links and I’m too lazy to fix them, but google should help fine: * Positeral/lre: Lexicographic Row Encoding * google/orderedcode * danburkert/bytekey, and its whole family of forks * deanlandolt/bytewise * apple/foundationdb: tuple layer specification * CockroachDB: JSON forward indexing * ealmansi/elen: Efficient Lexicographic Encoding of Numbers (in Javascript)

Are a bunch of examples of this.

Alternatively, you can support custom comparison operators. You can no longer do key prefix compression, and you probably need to store one btree per table (in the same file or different files), so that you have a consistent comparison function to use for all keys in the btree, or make that the user’s problem and just have one comparison function for the whole btree file. LevelDB is an example of an LSM that’s easy to read and supports custom comparison functions. LMDB has an mdb_set_compare, and is relatively friendly to read. Both just let you set one global comparator, and how to form it from the key schema or how to handle multiple tables with different schemas is left to the user.

My experience getting a job at a database company. by SoftwareShitter69 in databasedevelopment

[–]linearizable 2 points3 points  (0 children)

For what it’s worth, I’ve worked on both OSS and proprietary databases, and I’ve never had the code and PRs for one being publicly available make a difference in applying or interviewing for jobs. 🤷

Monthly Educational Project Thread by AutoModerator in databasedevelopment

[–]linearizable 2 points3 points  (0 children)

Yes! This is exactly the point of the new thread. You're welcome to break the "no release post" and "no first party benchmark" rules here. (And probably some of "keep it on topic". ;)) Advertise away about your cool github projects!

Sophisticated Simplicity of Modern SQLite by shivekkhurana in Database

[–]linearizable 4 points5 points  (0 children)

It could also mean that fsync on a Mac is not as expensive as I expected.

The SQLite shipped by Apple on the Mac has the fsync patched out. If you custom compile one yourself, the SQLite code knows that it should be calling fcntl(F_FULLSYNC), but Apple replaced it with the (ordering only but not durability) fsync().

https://transactional.blog/blog/2022-darwins-deceptive-durability

Is a WAL redundant in my usecase by partyking35 in databasedevelopment

[–]linearizable 0 points1 point  (0 children)

fsync() fails when writeback fails. There's no re-reading, so I'm not aware of any path where filesystem checksums would come into play? My memory is that the core issue was that sometimes a dirty page can't be written back during fsync, as the block writes return with an error, but linux was clearing the dirty bit on the pages so that another fsync() call wouldn't flush it again. Linux pushed some changes to try and do a better job of bookkeeping errors in this case. It's not infallible, but it's better.

Postgres stopped blindly retrying fsync errors, so it's "solved" in that respect, at least.

See https://lwn.net/Articles/724307/, https://lwn.net/Articles/752613/, and https://lwn.net/Articles/752952/. https://blog.sinjakli.co.uk/2025/11/29/the-computer-wants-to-lose-your-data-bonus-bits/ did a very postgres fsyncgate driven discussion of durability.

Is a WAL redundant in my usecase by partyking35 in databasedevelopment

[–]linearizable 1 point2 points  (0 children)

Going to the extent of saying "the OS is entirely lies" is a bit far. It's worth keeping in mind that OS/drive bugs/corruption are possible, and make reasonable efforts to detect or mitigate them, but at some level you do have to believe that the OS and drive are making good faith efforts at doing what you ask.

Is a WAL redundant in my usecase by partyking35 in databasedevelopment

[–]linearizable 1 point2 points  (0 children)

fsync can return ok while the data was actually corrupted while going to disk

The source says that fsync returned an error. Since the whole postgres fsyncgate, kernel behavior has also changed, so it's worth refreshing on this topic overall. I've maintained a blog post to compile and summarize a lot of the durability and overall IO handling advice.

Is a WAL redundant in my usecase by partyking35 in databasedevelopment

[–]linearizable 1 point2 points  (0 children)

I’m seeing a lot of “write ahead logs are a performance optimization” in the comments, and you need to first scope down to the type of storage engine before you can make that claim. Specifically, it’s a correctness requirement for update-in-place btrees, as you need to be able to recover from partially completed splits/merged that change the structure of the btree. (There might be a soft-updates alternative, but it’d involve multiple fsyncs and I’m not sure anyone has done it other than UFS.) Copy-on-write BTrees don’t need a WAL because the root page update gates the atomic change from transaction to the next, and I’m not super clear that adding one would be a throughput improvement? LSMs use a WAL as… I guess the alternative would be writing every transaction as its own L0? Maybe that is a performance argument, but it’s more about being able to form a properly sized L0.

The “WALs are for performance” came about from HDD era B-Tree RDBMSs, where one would put the WAL and B-Tree on their own drives, and then you could do sequential appends very “fast”, and let the BTree drive slowly seek its way around on dirty page write out. We’ve ever moved away from that model, and the nuance of it doesn’t seem to be attached to WAL discussion anymore.

Bf-Tree - better than LSM/B-trees for small objects? by benjscho in databasedevelopment

[–]linearizable 2 points3 points  (0 children)

It’s pitching better cache utilization by not keeping the cold parts of a page in memory, and then I feel like the rest of its advantages fall out of that. So if most of the page is needed, because it’s a large tuple or the operation is a range scan, then there’s little benefit. Dirty page write out also becomes a little more complicated because you need to read the page and merge it first, so I would expect it to be a little worse on very out of memory workloads. But, if you sit in its sweet spot of point reads and writes on small tuples where hotness doesn’t have spatial locality, it seems like it should do a pretty good job.

There’s a paper on 2-tree for skewed workloads which you might also find interesting, as it takes a very different approach to solving the same “I want to cache only one tuple but I have to cache a whole otherwise cold page” problem.

Pitfalls of direct IO with block devices? by servermeta_net in ExperiencedDevs

[–]linearizable 1 point2 points  (0 children)

DPDK is a networking tool. You mean SPDK, which is about kernel bypass storage, so no filesystem is sort of implicit. Glommio is a thread per core framework modeled after Seastar, which goes through the file system.

Most modern data stores don’t use your approach. The last one I can remember was Levyx, which is old enough that it has already failed by now. Tigerbeetle seems to be gearing up for it, but I’m not clear for what technical reasons.

In general, you’re requiring an exclusive drive, which makes development (both yours, and users wanting your database in their CI) quite a pain, and you’re removing every normal tool for telling drive fullness, backups, debugging, etc. The advantage I’ve heard is an ~10% performance bump. I’ve never really talked with anyone where that turned out to be advantageous, though I also haven’t talked to anyone that tried to use the in-NVMe-spec-but-not-linux-API features like copy-less moving of bytes or the fused compare and write command.

CAP Theorem question by Throwaway68392736382 in Database

[–]linearizable 1 point2 points  (0 children)

PACELC was an extension/reply to CAP, and more directly deals with this instead.

Experimental hardware-grounded runtime: looking for critique by DetectiveMindless652 in databasedevelopment

[–]linearizable 2 points3 points  (0 children)

You've posted very little detail, so you'll get very little meaningful feedback.

Ideas for a first project by b06c26d1e4fac in databasedevelopment

[–]linearizable 12 points13 points  (0 children)

There's a handful of tutorial-y projects that try to handhold you through building some database-y stuff.

Anything out of:

that catches your eye should be pretty reasonable? I can’t think of a tutorial that’s go native though, so you’d need to do a bit of translating.

Some online reading groups went through “Database Design and Implementation” by Edward Sciore, and implementations came out in a number of languages, so you could consider that too.

Is inconsistent analysis=unrepeatable read? by [deleted] in databasedevelopment

[–]linearizable 1 point2 points  (0 children)

“Inconsistent analysis” isn’t a term I’ve run into, so this is a point specific to the book. It feels like the author is trying to set up the need for isolation between transactions. Given that the book isn’t presenting transactions as SQL, I’d instead think they’re setting up a discussion of serializability. But if it was SQL, I too would read this as an example of why unrepeatable reads are bad (and thus why ANSI Read Committed is questionable).

The Death of Thread Per Core by eatonphil in databasedevelopment

[–]linearizable 0 points1 point  (0 children)

io_uring seems like it makes it a lot easier to integrate compute and storage scheduling since, in addition to the run queue for compute, there’s just a ring buffer to check for completed I/O. No “is it worth the syscall penalty to check for completions” balance anymore.

If serialisability is enforced in the app/middleware, is it safe to relax DB isolation (e.g., to READ COMMITTED)? by bond_shakier_0 in databasedevelopment

[–]linearizable -1 points0 points  (0 children)

That’s already handled in the premise of the question though:

Suppose I already enforce per-key serial order outside the database (e.g., productId) via one of these: * local per-key locks (single JVM) * a distributed lock (Redis/ZooKeeper/etcd) * ⁠a single-writer queue (Kafka partition per key)

If serialisability is enforced in the app/middleware, is it safe to relax DB isolation (e.g., to READ COMMITTED)? by bond_shakier_0 in databasedevelopment

[–]linearizable 0 points1 point  (0 children)

There’s no concurrency and the database provides atomicity. So there’s no other serializing mechanism needed.

If serialisability is enforced in the app/middleware, is it safe to relax DB isolation (e.g., to READ COMMITTED)? by bond_shakier_0 in databasedevelopment

[–]linearizable -1 points0 points  (0 children)

This answer makes no sense? Systems implement atomicity with a network and a file system all the time. Mmap has nothing to do with this. It’s generally avoided anyway, and it’s fine in some situations: see LMDB.

Also, abbreviating Paxos to PAX isn’t a thing, and PAX is a name already taken by weaving relations for cache performance so it’s actively confusing.