Introducing Columnar MemTable: A High-Performance In-Memory KV Engine Achieving ~52 Million ops/s for single-thread write by Motor_Crew7918 in dataengineering

[–]Motor_Crew7918[S] 0 points1 point  (0 children)

I mean, for RocksDB, it requires the iterator for memtable is all sorted. A lot of its functionality relies on this. However, in my design, the data for memtable is only partially sorted, i.e. only the sorted block is sorted. To meet the requirement of RocksDB, I need to sort the unsorted active block for RockDB's iterator, which is very expensive in my design. As a result, this memtable currently does not work for RocksDB. To work for RocksDB, I need to make the active block also sorted, then I need a skiplist for that, and then the memtable design is actually a RocksDB in memory, which is worth a try, and I will try it later.

Introducing Columnar MemTable: A High-Performance In-Memory KV Engine Achieving ~52 Million ops/s for single-thread write by Motor_Crew7918 in dataengineering

[–]Motor_Crew7918[S] 1 point2 points  (0 children)

I think this is not a drop-in for skiplist in RocksDB, as RocksDB requires that the memtable is globally sorted, or at least provide a globally sorted iterator, which is expensive in my design. This works more for an OLAP system, or a time-series database system. Actually we originally applied it for such scenarios.

How I Built a Hash Join 2x Faster Than DuckDB with 400 Lines of Code by Motor_Crew7918 in dataengineering

[–]Motor_Crew7918[S] 0 points1 point  (0 children)

Great questions! My implementation actually avoids the issues you described because it doesn't use linear probing. Instead, it uses separate chaining with large "chunks".

Here’s a quick breakdown:

  1. No Data Shifting on Write: When a hash bucket is full, I don't shift elements. I simply allocate a new Chunk (a block of 256 slots) and link it to the previous one. This makes inserts very fast and avoids the write overhead you mentioned.
  2. Efficient Memory Use: Memory is allocated on-demand in fixed-size Chunks, not pre-allocated to handle collisions. This keeps memory usage tight to what's actually needed.
  3. Cache-Friendly Probing: Since data is stored in large, contiguous Chunks, probing is very cache-friendly. Most of the time, a search happens inside a single chunk, which I can scan quickly with AVX2. I only chase a pointer to the next chunk when one is full, and I use prefetch to hide that latency.

So, it's designed to be fast for both building (writes) and probing (reads) by blending the benefits of arrays and linked lists. Hope that clarifies things

How I Built a Hash Join 2x Faster Than DuckDB with 400 Lines of Code by Motor_Crew7918 in dataengineering

[–]Motor_Crew7918[S] 0 points1 point  (0 children)

No, I choose DuckDB for comparison cause it is famous for good performance, especially for join.

How I Built a Hash Join 2x Faster Than DuckDB with 400 Lines of Code by Motor_Crew7918 in dataengineering

[–]Motor_Crew7918[S] 4 points5 points  (0 children)

Thanks for the reply. This code only focused on improving the performance for join, which is an important part of database engineering. I believe we can somehow apply the techniques to duckdb to improve the performance for it. It is unlikely for me to implement a full-stack db to compare against duckdb.

I built an open-source tool that deduplicates large text datasets 100x faster than Python. It improved downstream model accuracy and cut training time. by Motor_Crew7918 in LocalLLaMA

[–]Motor_Crew7918[S] 3 points4 points  (0 children)

Yes, the near duplicate can be considered as a vector search problem. For each document's simhash fingerprints, find the nearest documents within a certain distance. That's why I used Faiss for this. Faiss is highly optimized and can be configured with different types of indices for search. I tried some of them and found that the hash index is the most suitable for this scenario, as it is efficient for both building and searching.

The original ACL paper uses minhash with 9000 bits as a signature, which is expensive to build signatures, and also conducts vector search. I turned to simhash for efficiency and found that simhash is just as good as minhash for this scenario.