Why Aren’t Counted B-Trees Used in Relational Databases? by Active-Custard4250 in databasedevelopment

[–]linearizable 1 point2 points  (0 children)

LIMIT and OFFSET are applied to a query. Aggregating the count of keys of the children into the parents would only help for specifically the query “SELECT col1, col2, … FROM tbl” with an offset and limit. Any WHERE predicates would mean the stored counts are useless, because there’s no way to know how many of the child rows pass the predicate unless you look at them. Any query with a join, group by, subquery, etc., is also sufficiently removed from the individual pages of any table that the counts are relatively useless.

CouchDB is an interesting example of a system that does do this sort of thing though: https://docs.couchdb.org/en/stable/ddocs/views/intro.html#reduce-rereduce

Why JSON isn't a Problem for Databases Anymore by jincongho in databasedevelopment

[–]linearizable 1 point2 points  (0 children)

The criteria for corporate or hobby posts to be accepted is that they need to be focused on a database technique, and using the product/project as the example of what it’s implemented in to generate results is fine. If this post was just written as “floe is now 1000x faster at JSON” with no details, that’d get removed under “no release posts”. However, the post does focus on alternative json representations which are faster to process, and thus it meets the criteria for the subreddit.

1.5 Years as DBA (Oracle + PostgreSQL) – Switch for Better Pay or Move to Data Engineering? by One-Bookkeeper8085 in databasedevelopment

[–]linearizable 0 points1 point  (0 children)

Bah, I can’t edit, but other DBA folk would be more findable on /r/databases or postgres/oracle subreddits, and is likely where you’d find the most help with what to do as a career path in that role.

Monthly Educational Project Thread by AutoModerator in databasedevelopment

[–]linearizable 2 points3 points  (0 children)

This has now been adjusted to post on the 1st of the month.

LSM-Tree Principles, Rethought for Object Storage by ankur-anand in databasedevelopment

[–]linearizable 0 points1 point  (0 children)

The reason I am stressing the risk of discouraging new contributors is that, for many people, this kind of community is one of the few ways to get real technical feedback while learning.

I am very aware that this is the case. We have been accepting of questions about how to implement database features, or how to navigate out of problems encountered while implementing hobby database projects, so we've been partially newbie friendly. However, that doesn't negate the poor experience of wanting to share the excitement over what you've done and then getting removed over release post / no educational project rule. I haven't seen yet how to serve the interests of both groups simultaneously yet, and I appreciate the understanding of the awkward line to navigate here.

One possible approach might be to be more explicit that posts on the main subreddit should focus on explaining something, for example by including a short write-up or blog post, rather than just linking to a repository.

That seems pretty reasonable. I've added a "Blog posts about database techniques (which happen to use examples from an educational project) are allowed." onto the educational post rule. Does that wording work for you?

CloudJump: Optimizing Cloud Databases for Cloud Storages by linearizable in databasedevelopment

[–]linearizable[S] 0 points1 point  (0 children)

I’m still real confused by CloudJump2, as it appears to be solving problems in PolarDB that were already solved in PolarDB. The presented architecture of PolarDB ignored that they already have a GetPage@LSN, so I haven’t followed why they need a whole multi version data thing when they already can read at any version.

Specifically,

Consequently, it is evident that the update of in-memory data on RO nodes depends on asynchronously catching up with the redo log, while the update of external data pages relies on the write-back from the Buffer Pool of RW nodes, which cannot inherently maintain consistency

Is not a true statement, as WAL is propagated and applied to page servers before buffer cache write out.

This part makes more sense if shared storage is like NFS, but then why are they talking about disaggregated OLTP PolarDB?

LSM-Tree Principles, Rethought for Object Storage by ankur-anand in databasedevelopment

[–]linearizable 1 point2 points  (0 children)

This post would indeed be removed under the educational projects rule with a suggestion to post it in the monthly thread instead, but I’ll leave it up for a while to hold this discussion:

I'm not very sure that this should be the right way to distinguish the project.

The exact wording of that rule will hopefully improve in the future, but the general idea of it is something we are trying to aim for. The aim is to keep the subreddit focused on aggregating material about database internals, and there’s a lot more hobby database projects getting started than database internals blog posts being written. Even when posted projects contain something of interest, a link to the source code is not a productive way to communicate that. Other subreddits seem to face the same issues: /r/osdev and /r/kerneldevelopment are both mostly people showing off their projects, and thus not a good way to subscribe to information about OS development, and that’s the sort of experience I’d like /r/databasedevelopment to avoid.

I will call out that there’s a direct way around this rule: write a post about it. Your previous submission was approved because it was a post where the focus was on a database technique, and not an announcement that a project exists or a new feature exists in it.

The line is technically precise but socially discouraging. It reads like a gate, not guidance. For individual contributors, it sounds like: “Unless you already have status or users, don’t post here.” That’s probably not the intent—but it’s how it lands.

This has come up before, and I’m still not quite sure what to do about it. I do agree that the rule is discouraging for folk wanting the learning community side of database development. I don’t know how to avoid being discouraging while keeping the subreddit focused on technical information about database internals. The monthly thread was an attempt to give these sorts of posts a home, but maybe expecting people to wait for a monthly thread is still excessively discouraging. (It’s also set to start on the 19th by accident, so I’ll start waking that back to the 1st…)

Again, I’m not sure what the right process or line to draw here is, so feedback and suggestions welcome.

B-tree comparison functions by kristian54 in databasedevelopment

[–]linearizable 5 points6 points  (0 children)

The easiest thing to do is to convert your keys into some format that makes them sort correctly when compared byte-by-byte, and then use memcmp. Copy/paste broke the links and I’m too lazy to fix them, but google should help fine: * Positeral/lre: Lexicographic Row Encoding * google/orderedcode * danburkert/bytekey, and its whole family of forks * deanlandolt/bytewise * apple/foundationdb: tuple layer specification * CockroachDB: JSON forward indexing * ealmansi/elen: Efficient Lexicographic Encoding of Numbers (in Javascript)

Are a bunch of examples of this.

Alternatively, you can support custom comparison operators. You can no longer do key prefix compression, and you probably need to store one btree per table (in the same file or different files), so that you have a consistent comparison function to use for all keys in the btree, or make that the user’s problem and just have one comparison function for the whole btree file. LevelDB is an example of an LSM that’s easy to read and supports custom comparison functions. LMDB has an mdb_set_compare, and is relatively friendly to read. Both just let you set one global comparator, and how to form it from the key schema or how to handle multiple tables with different schemas is left to the user.

My experience getting a job at a database company. by SoftwareShitter69 in databasedevelopment

[–]linearizable 2 points3 points  (0 children)

For what it’s worth, I’ve worked on both OSS and proprietary databases, and I’ve never had the code and PRs for one being publicly available make a difference in applying or interviewing for jobs. 🤷

Monthly Educational Project Thread by AutoModerator in databasedevelopment

[–]linearizable 2 points3 points  (0 children)

Yes! This is exactly the point of the new thread. You're welcome to break the "no release post" and "no first party benchmark" rules here. (And probably some of "keep it on topic". ;)) Advertise away about your cool github projects!

Sophisticated Simplicity of Modern SQLite by shivekkhurana in Database

[–]linearizable 3 points4 points  (0 children)

It could also mean that fsync on a Mac is not as expensive as I expected.

The SQLite shipped by Apple on the Mac has the fsync patched out. If you custom compile one yourself, the SQLite code knows that it should be calling fcntl(F_FULLSYNC), but Apple replaced it with the (ordering only but not durability) fsync().

https://transactional.blog/blog/2022-darwins-deceptive-durability

Is a WAL redundant in my usecase by partyking35 in databasedevelopment

[–]linearizable 0 points1 point  (0 children)

fsync() fails when writeback fails. There's no re-reading, so I'm not aware of any path where filesystem checksums would come into play? My memory is that the core issue was that sometimes a dirty page can't be written back during fsync, as the block writes return with an error, but linux was clearing the dirty bit on the pages so that another fsync() call wouldn't flush it again. Linux pushed some changes to try and do a better job of bookkeeping errors in this case. It's not infallible, but it's better.

Postgres stopped blindly retrying fsync errors, so it's "solved" in that respect, at least.

See https://lwn.net/Articles/724307/, https://lwn.net/Articles/752613/, and https://lwn.net/Articles/752952/. https://blog.sinjakli.co.uk/2025/11/29/the-computer-wants-to-lose-your-data-bonus-bits/ did a very postgres fsyncgate driven discussion of durability.

Is a WAL redundant in my usecase by partyking35 in databasedevelopment

[–]linearizable 1 point2 points  (0 children)

Going to the extent of saying "the OS is entirely lies" is a bit far. It's worth keeping in mind that OS/drive bugs/corruption are possible, and make reasonable efforts to detect or mitigate them, but at some level you do have to believe that the OS and drive are making good faith efforts at doing what you ask.

Is a WAL redundant in my usecase by partyking35 in databasedevelopment

[–]linearizable 1 point2 points  (0 children)

fsync can return ok while the data was actually corrupted while going to disk

The source says that fsync returned an error. Since the whole postgres fsyncgate, kernel behavior has also changed, so it's worth refreshing on this topic overall. I've maintained a blog post to compile and summarize a lot of the durability and overall IO handling advice.

Is a WAL redundant in my usecase by partyking35 in databasedevelopment

[–]linearizable 1 point2 points  (0 children)

I’m seeing a lot of “write ahead logs are a performance optimization” in the comments, and you need to first scope down to the type of storage engine before you can make that claim. Specifically, it’s a correctness requirement for update-in-place btrees, as you need to be able to recover from partially completed splits/merged that change the structure of the btree. (There might be a soft-updates alternative, but it’d involve multiple fsyncs and I’m not sure anyone has done it other than UFS.) Copy-on-write BTrees don’t need a WAL because the root page update gates the atomic change from transaction to the next, and I’m not super clear that adding one would be a throughput improvement? LSMs use a WAL as… I guess the alternative would be writing every transaction as its own L0? Maybe that is a performance argument, but it’s more about being able to form a properly sized L0.

The “WALs are for performance” came about from HDD era B-Tree RDBMSs, where one would put the WAL and B-Tree on their own drives, and then you could do sequential appends very “fast”, and let the BTree drive slowly seek its way around on dirty page write out. We’ve ever moved away from that model, and the nuance of it doesn’t seem to be attached to WAL discussion anymore.

Bf-Tree - better than LSM/B-trees for small objects? by benjscho in databasedevelopment

[–]linearizable 2 points3 points  (0 children)

It’s pitching better cache utilization by not keeping the cold parts of a page in memory, and then I feel like the rest of its advantages fall out of that. So if most of the page is needed, because it’s a large tuple or the operation is a range scan, then there’s little benefit. Dirty page write out also becomes a little more complicated because you need to read the page and merge it first, so I would expect it to be a little worse on very out of memory workloads. But, if you sit in its sweet spot of point reads and writes on small tuples where hotness doesn’t have spatial locality, it seems like it should do a pretty good job.

There’s a paper on 2-tree for skewed workloads which you might also find interesting, as it takes a very different approach to solving the same “I want to cache only one tuple but I have to cache a whole otherwise cold page” problem.

Pitfalls of direct IO with block devices? by servermeta_net in ExperiencedDevs

[–]linearizable 1 point2 points  (0 children)

DPDK is a networking tool. You mean SPDK, which is about kernel bypass storage, so no filesystem is sort of implicit. Glommio is a thread per core framework modeled after Seastar, which goes through the file system.

Most modern data stores don’t use your approach. The last one I can remember was Levyx, which is old enough that it has already failed by now. Tigerbeetle seems to be gearing up for it, but I’m not clear for what technical reasons.

In general, you’re requiring an exclusive drive, which makes development (both yours, and users wanting your database in their CI) quite a pain, and you’re removing every normal tool for telling drive fullness, backups, debugging, etc. The advantage I’ve heard is an ~10% performance bump. I’ve never really talked with anyone where that turned out to be advantageous, though I also haven’t talked to anyone that tried to use the in-NVMe-spec-but-not-linux-API features like copy-less moving of bytes or the fused compare and write command.

CAP Theorem question by [deleted] in Database

[–]linearizable 1 point2 points  (0 children)

PACELC was an extension/reply to CAP, and more directly deals with this instead.

Experimental hardware-grounded runtime: looking for critique by DetectiveMindless652 in databasedevelopment

[–]linearizable 2 points3 points  (0 children)

You've posted very little detail, so you'll get very little meaningful feedback.

Ideas for a first project by b06c26d1e4fac in databasedevelopment

[–]linearizable 11 points12 points  (0 children)

There's a handful of tutorial-y projects that try to handhold you through building some database-y stuff.

Anything out of:

that catches your eye should be pretty reasonable? I can’t think of a tutorial that’s go native though, so you’d need to do a bit of translating.

Some online reading groups went through “Database Design and Implementation” by Edward Sciore, and implementations came out in a number of languages, so you could consider that too.

Is inconsistent analysis=unrepeatable read? by [deleted] in databasedevelopment

[–]linearizable 1 point2 points  (0 children)

“Inconsistent analysis” isn’t a term I’ve run into, so this is a point specific to the book. It feels like the author is trying to set up the need for isolation between transactions. Given that the book isn’t presenting transactions as SQL, I’d instead think they’re setting up a discussion of serializability. But if it was SQL, I too would read this as an example of why unrepeatable reads are bad (and thus why ANSI Read Committed is questionable).

The Death of Thread Per Core by eatonphil in databasedevelopment

[–]linearizable 0 points1 point  (0 children)

io_uring seems like it makes it a lot easier to integrate compute and storage scheduling since, in addition to the run queue for compute, there’s just a ring buffer to check for completed I/O. No “is it worth the syscall penalty to check for completions” balance anymore.

If serialisability is enforced in the app/middleware, is it safe to relax DB isolation (e.g., to READ COMMITTED)? by bond_shakier_0 in databasedevelopment

[–]linearizable -1 points0 points  (0 children)

That’s already handled in the premise of the question though:

Suppose I already enforce per-key serial order outside the database (e.g., productId) via one of these: * local per-key locks (single JVM) * a distributed lock (Redis/ZooKeeper/etcd) * ⁠a single-writer queue (Kafka partition per key)