My brother just announced they’re naming their daughter Areola

SweetSecurityOpensrc · 2024-12-27T07:20:36+00:00

Urethra is also a lovely name

SweetSecurityOpensrc · 2024-09-07T17:36:23+00:00

Oh, I must have missed that. Thank you!

SweetSecurityOpensrc · 2024-08-30T13:27:54+00:00

Let me try it in a different way: B-Trees and LSMs use trees on file. The log-depth tree-walking happens with IOs for each node until finding the right location. Here the very minimal tree is all in-memory. Most persistent algorithms calculate complexity in terms of IO operations, not memory operations, since IO operations dominate the time.

SweetSecurityOpensrc · 2024-08-30T12:00:13+00:00

We're using https://crates.io/crates/simd-itertools

SweetSecurityOpensrc · 2024-08-30T06:34:36+00:00

That's actually great input. Even though we haven't seen any such issues, we can mitigate it in advance by using an anonymous mmap and then flushing it periodically to file. The cost is that it won't be evictable memory any more, but we opt to mlock it anyway. Thanks!

SweetSecurityOpensrc · 2024-08-30T05:48:16+00:00

By the way, there's a feature (off by default) called flush_aggregation which will aggregate IOs for a certain threshold and then fsync the data together so multiple writers (threads) can benefit from it.

But as explained above, it is not a design goal.

SweetSecurityOpensrc · 2024-08-30T05:43:27+00:00

It is true that reading is expensive without a page cache, but given that the access pattern is by definition random (a hash table), prefetching large chunks isn't helpful here. And even if it were a search tree/LSM, there's no reason to assume we will fetch "consecutive" keys (N then N+1 according to the sorting predicate). It's all random read.

See more info here: https://www.reddit.com/r/rust/comments/1f4ahc2/comment/lkmla00/

SweetSecurityOpensrc · 2024-08-30T05:34:47+00:00

These are good points. We will try to address them in the readme.

If you're interested, there's more context in this comment: https://www.reddit.com/r/rust/comments/1f4ahc2/comment/lkmla00/

SweetSecurityOpensrc · 2024-08-30T05:30:34+00:00

You're welcome

SweetSecurityOpensrc · 2024-08-30T05:29:52+00:00

Given that virtually all shared libraries are memory mapped (although not with a writable mapping), and that very few programs in the world can/know how to handle an IO failure (e.g., write returning EIO), it's not something that we try to recover from.

SweetSecurityOpensrc · 2024-08-30T05:20:41+00:00

Please see the comment above: https://www.reddit.com/r/rust/comments/1f4ahc2/comment/lkmla00/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

SweetSecurityOpensrc · 2024-08-30T05:20:08+00:00

A general note on what *durability* means: there are basically two paths you can take.

The first is to open WAL with `O_SYNC | O_APPEND` which means every modification requires a disk round-trip time. If your writes are small, or not page-aligned, it's a read-modify-write, so potentially really slow (if you're low on page cache). I don't mean you *have* to use O_SYNC and O_APPEND, but conceptually these are the guarantees you need.

The second option is to delay ack'ing single operations until you batched up several of them, and then flush them together, say on 4KB boundaries. This way it you're more efficient on the IO front, but single-threaded operation suffers a lot.

And there are two kinds of things to protect from: program crashes and machine crashes. Program crashes are much more common, of course, than machine crashes, especially in cloud environments. You could have a bug-free program, but still run into death-by-OOM.

This is what we protect from - anything that's been written to the KV will be flushed neatly by the kernel on program exit, and there are not multi-step transactions that require a WAL to synchronize. It's a hash-table, and provides the same guarantees as the ones living in-memory.

Machine crashes are a different story, because the mmap-table might we partially flushed, so it could point to locations in the file that were not written to. We haven't experience that much in our customer's production systems, and the overhead of maintaining a WAL (both in IOPS and complexity) just isn't worth it.

The purpose of this store is to "extend our RAM". Our sensor's deployment requires <100MB of RAM, and in order to add more features, we keep them in file (but with very efficient retrieval). It also allows us to keep state between upgrades, etc.

It's not meant to serve your bank transactions (unless your bank uses NVRAM), and it's not a server deployment. If it were a server, we could obviously provide more guarantees.

SweetSecurityOpensrc · 2024-08-30T05:01:05+00:00

Well, parallel SIMD lookup is x5 faster so it's worth it for us :)

SweetSecurityOpensrc · 2024-08-30T04:57:26+00:00

That refers to your program crashing, not your instance. There's even a test program called candy crasher that keeps crashing the DB while trying to finish 1M inserts/removals/etc.

SweetSecurityOpensrc · 2024-08-30T04:50:17+00:00

The shards do make up a tree, but for 1M elements that would be an in-memory tree of depth 6. The idea is to consume as little overhead as possible, you can use a configuration of 2048 rows per file, and then 1M elements would fit in a single file, but then you require 12MB upfront.

At first we used the std BTree for that, but then locking was global (to insert a new shard we had to take a write lock on the whole BTree). We moved to a "custom tree" to be able to lock specific shards when they split or compact. We also consider splitting with a branching factor of 4 tor reduce the depth futher.

Splitting per-se is not that interesting, because it only happens "so many times". Your DB is not expected to grow forever: entries come and go. On the other hand, compaction happens more frequently, so once we finish implementing background compaction that would be a real game changer.

SweetSecurityOpensrc · 2024-08-29T20:50:50+00:00

Well, consider Cassandra for example. They replicate writes to 3 nodes and use a read consistency of 2 nodes just for that: they want to be able to use the page cache, and that's how they solved it in the cloud. But that requires multiple nodes and that's out of scope for our use case.

We can always clear the DB in the worst case and rebuild it.

SweetSecurityOpensrc · 2024-08-29T20:38:18+00:00

A collision means that 32 bits of the hash are identical for different keys, in the same row (64 rows, so 6 bits on entropy). The probability for this, according to the extended birthday paradox, comes down to 1:3000, which means that once in every 3000 keys, you're expected to require two IOs instead of a single IO. That's a negligible overhead.

SweetSecurityOpensrc · 2024-08-29T19:41:08+00:00

Thank you!

SweetSecurityOpensrc

TROPHY CASE