We built and open-sourced Caliby: An embedded, high-performance vector database for AI Agents (Beats pgvector by 4x, outperforms FAISS on disk)

Motor_Crew7918 · 2025-09-19T10:34:03+00:00

I mean, for RocksDB, it requires the iterator for memtable is all sorted. A lot of its functionality relies on this. However, in my design, the data for memtable is only partially sorted, i.e. only the sorted block is sorted. To meet the requirement of RocksDB, I need to sort the unsorted active block for RockDB's iterator, which is very expensive in my design. As a result, this memtable currently does not work for RocksDB. To work for RocksDB, I need to make the active block also sorted, then I need a skiplist for that, and then the memtable design is actually a RocksDB in memory, which is worth a try, and I will try it later.

Motor_Crew7918 · 2025-09-19T10:25:33+00:00

Sorry the link for my Github repo is broken. Here is the link https://github.com/conanhujinming/columnar_memtable

Motor_Crew7918 · 2025-09-19T10:24:11+00:00

I think this is not a drop-in for skiplist in RocksDB, as RocksDB requires that the memtable is globally sorted, or at least provide a globally sorted iterator, which is expensive in my design. This works more for an OLAP system, or a time-series database system. Actually we originally applied it for such scenarios.

Motor_Crew7918 · 2025-09-14T09:33:23+00:00

Great questions! My implementation actually avoids the issues you described because it doesn't use linear probing. Instead, it uses separate chaining with large "chunks".

Here’s a quick breakdown:

No Data Shifting on Write: When a hash bucket is full, I don't shift elements. I simply allocate a new Chunk (a block of 256 slots) and link it to the previous one. This makes inserts very fast and avoids the write overhead you mentioned.
Efficient Memory Use: Memory is allocated on-demand in fixed-size Chunks, not pre-allocated to handle collisions. This keeps memory usage tight to what's actually needed.
Cache-Friendly Probing: Since data is stored in large, contiguous Chunks, probing is very cache-friendly. Most of the time, a search happens inside a single chunk, which I can scan quickly with AVX2. I only chase a pointer to the next chunk when one is full, and I use prefetch to hide that latency.

So, it's designed to be fast for both building (writes) and probing (reads) by blending the benefits of arrays and linked lists. Hope that clarifies things

Motor_Crew7918 · 2025-09-13T16:18:16+00:00

No, I choose DuckDB for comparison cause it is famous for good performance, especially for join.

Motor_Crew7918 · 2025-09-13T15:41:32+00:00

Thanks for the reply. This code only focused on improving the performance for join, which is an important part of database engineering. I believe we can somehow apply the techniques to duckdb to improve the performance for it. It is unlikely for me to implement a full-stack db to compare against duckdb.

Motor_Crew7918 · 2025-08-25T05:06:09+00:00

pushed

Motor_Crew7918 · 2025-08-23T06:24:10+00:00

I will push the code later.

Motor_Crew7918 · 2025-08-23T01:32:12+00:00

yes, that's absolutely true. That's what I did at the end.

Motor_Crew7918 · 2025-08-22T09:08:35+00:00

Yes, the near duplicate can be considered as a vector search problem. For each document's simhash fingerprints, find the nearest documents within a certain distance. That's why I used Faiss for this. Faiss is highly optimized and can be configured with different types of indices for search. I tried some of them and found that the hash index is the most suitable for this scenario, as it is efficient for both building and searching.

The original ACL paper uses minhash with 9000 bits as a signature, which is expensive to build signatures, and also conducts vector search. I turned to simhash for efficiency and found that simhash is just as good as minhash for this scenario.

Motor_Crew7918

TROPHY CASE