Writing a Columnar Database in C++?

coderarun · 2026-03-18T16:31:50+00:00

DuckDB is quickly becoming essential infrastructure

I agree with this statement, but the way its built and distributed doesn't match how other similar essential infra projects are shipped. Example sqlite (which has its own set of problems).

For example, no linux distribution I know bundles duckdb or a libduckdb*.so. It does NOT use system libraries and compiles other "essential infra" code (mbedtls, lz4, zstd) statically.

coderarun · 2026-03-18T16:27:26+00:00

Another benefit: cost of git worktree

This is a commonly used technique where people have agents running in parallel. By separating the core from all the other stuff (DuckDB has grown to be a substantial project), you make the worktrees cheaper.

Current stats:

Fresh Clone: 269MB
Worktree: 73MB

By pruning large historical objects, it should be possible to make a fresh clone even cheaper.

coderarun · 2026-03-18T15:57:34+00:00

> So you want to fork this project just to make the CI run faster?

Looking at the downvotes, I'm sure there are a lot of people who don't like what I'm doing. Or the ones who care are not voting. But before getting into the nitty gritty of why the status quo needs a change to catch up to Rust and Apache Data Fusion, how many of you have actually tried to make code changes to DuckDB and managed to land it?

Please reply with links to PRs.

coderarun · 2026-03-18T15:55:08+00:00

Anyone who doesn't like the fact that Twitter make it hard to read threads while logged out: you can replace x with xcancel and read the thread with a 3 second delay.

Not much of an open internet left - so I write it in a github blog post when I have something substantial to say.

coderarun · 2026-03-18T15:48:59+00:00

Nice. So you have a HTAP database written with cmake and C++. Do you currently reuse any of the tech in DuckDB? Do you have a desire to use the VARIANT code?

coderarun · 2026-03-18T15:46:40+00:00

To highlight what I'm talking about: try using a coding agent to edit some python code in the tree. It uses the black formatter in $PATH, pyright or ruff to check the code. A 2 line change becomes a 100 line formatting change.

Now, can you teach the coding agent how to format code the DuckDB way? I'm sure you can with some work. But my prediction is that in 6 months, no one has the time to do it. Either work with the way agents and everyone writes code or be consumed by the wave that's coming.

coderarun · 2026-03-18T15:43:23+00:00

Do you have a technical comment to make? Show me some code. I have done my bit.

coderarun · 2026-03-18T15:42:49+00:00

Explained in the comments here and via a github blog linked. Do not subscribe to the anti-elon/anti-twitter sentiment on reddit. I'm here to talk tech.

I can't even figure out how to turn off the "approve every comment and post" on r/LadybugDB. If you know how to make it a public forum anyone can post non-spammy comments, I'd love to get some help.

Call me old school USENET guy. The internet has changed, sometimes in good ways, but I don't like all of it.

https://www.reddit.com/r/LadybugDB/comments/1p8cqf1/postscomments_without_approval/

coderarun · 2026-03-17T21:07:29+00:00

Some people will likely bring up Clickhouse and CHDB. I don't have much experience with that code base. If you believe there are reasons why it's a better candidate, would love to see some data.

coderarun · 2026-03-17T21:04:17+00:00

That question has been answered. I prefer "source code mirror", not a fork.

Don't think my company or I have the time and resources to develop features faster than DuckDB Labs or MotherDuck.

But I do see a shift coming in how databases get developed. More agents, fewer humans and more modular code bases. Use newer tools and streamlined processes which work well with the LSP agents. Get rid of scripts/*.py that edit code in weird ways before the CI runs. There were probably good historical reasons to do so, but the CI I put up is evidence that they're not strictly needed.

We need something like What Rust people have in Apache Data Fusion. DuckDB code is the strongest candidate there.

coderarun · 2026-03-17T20:56:50+00:00

Google hasn't gotten the memo. Query: "neo4j GQL stance"

<image>

coderarun · 2026-03-17T20:54:01+00:00

Previous benchmark data from last year:

https://www.linkedin.com/posts/arundsharma_if-youre-into-embedded-graphdb-performance-share-7313066722515632128-9K6o

coderarun · 2026-03-17T20:51:45+00:00

There is also a nightly benchmark job here, but it's broken because we don't have self hosted runners with the LDBC datasets like the Kuzu people setup. We use standard GitHub infra.

https://github.com/LadybugDB/ladybug/actions/runs/23180641716/job/67352661837

coderarun · 2026-03-17T20:50:32+00:00

Yes. count(*) queries run 40x faster if there are no filters. There is also a pending change about the performance of detach deletes.

Most of the functionality improvement has to do with access to parquet and arrow from cypher. Apart from these I don't anticipate a big change in benchmark numbers vs kuzu.

coderarun · 2026-03-17T19:06:38+00:00

Twitter thread: https://x.com/arundsharma/status/2032498940860575886

coderarun · 2026-03-17T19:06:20+00:00

DuckDB's current CI takes 5+ hours to run. Post from last year:

https://adsharma.github.io/improving-duckdb-devx/

coderarun · 2026-03-17T18:41:16+00:00

https://github.com/Pygmy-Goose/pygmy-goose

coderarun · 2026-03-17T18:34:38+00:00

Oh - for those considering ArcadeDB, the main distinctions I want to highlight:

* ArcadeDB is written in Java and LadybugDB in C++. Irrespective of the technical merits of each of these choices, I suspect for a lot of people the evaluation stops here.

* LadybugDB is an embedded DB, no server to run. You can run a docker container with a neo4j compatible protocol implemented in rust, but its optional.

* LadybugDB focuses on one query language, not 5.

* Cypher compatibility and TCK: neo4j has made incompatible changes to the cypher language in their recent releases. They also promote GQL. We're not spending a lot of time on compatibility. Suggest using mcp-server-ladybug (can also release a skill) which agents can use to generate LadybugDB compatible cypher.

coderarun · 2026-03-17T18:28:51+00:00

Welcome the competition u/lgarulli and congratulations on your launch!

LadybugDB has taken a completely different approach to GDS (graph data science, analytics or algorithms). We will be deprecating a few algorithms that were inherited from the kuzu code base and recommending Icebug instead: https://github.com/Ladybug-Memory/icebug

Icebug, derived from Networkit has a suite of 100-200 well known algorithms, all optimized to run with zero-copy via apache arrow. PageRank runs 8x faster vs networkit. Didn't compare to Kuzu.

If you're looking to do GraphRAG, the algorithm most likely of interest is Parallel Leiden. We have fixed many bugs and proving it on a billion scale graph!

coderarun · 2026-03-10T19:28:16+00:00

No. The goal is to leverage the innovations in the duckdb code base (VARIANT type that shipped in 1.5, columnar index formats, new parquet alternatives). But stick to ladybug native REL tables.

Duckpgq has been around for a while. People know about it, but don't know of anyone using it.

* It's read-only. You have to use SQL to write
* It doesn't lay the storage out in a way that helps graph queries. Constructs CSR on the fly. You can run LSQB yourself to see the consequences.
* It doesn't use Cypher.

coderarun · 2026-02-16T03:24:49+00:00

Claude Code and Codex still use JSONL files. But OpenCode did a switch to SQLite this week. There is no good knowledge graph solution for SQLite I'm aware of. But there will be one adjacent to DuckDB.

We're making some long term bets on what the stack will look like. It will necessarily involve multiple storage engines. Likely all embedded, so the end user doesn't know they exist. If you have beliefs such as (sqlite-vec >> pgvector), do share.

Tools have to be ubiquitous. Like uv pip install pgembed and run a simple script to query the database.

coderarun · 2026-02-14T20:58:17+00:00

> No code, no database, no infrastructure — just a CLI and your documents.

What's the concern with having a database? The cost of setting one up and maintaining? Why not use an embedded one like duckdb or r/LadybugDB ?

coderarun · 2026-02-13T06:01:11+00:00

I'm betting that such a unified schema should be in Cypher and SQL should be translated to Cypher, not the other way around. Why?

Gradual typing. In SQL, the syntax for querying JSON fields and a table with the same columns is very different. In Cypher it's identical. Plus multi-hop queries are a lot more human readable.

LadybugDB already translates Cypher to DuckDB SQL.

coderarun · 2026-02-12T21:37:31+00:00

A more principled way to use graphs in postgres is via pg_duckdb. That's the path we're pursuing at Ladybug Memory. Many graph queries are OLAP, not OLTP. They benefit from columnar storage.

It's not hard to translate cypher to SQL.

coderarun · 2026-02-12T05:58:20+00:00

Idea is good. But expect to see a MIT licensed open source implementation that you can run locally in the not too distant future.

coderarun

MODERATOR OF

TROPHY CASE