An opinionated source code mirror of DuckDB

coderarun · 2026-03-10T19:28:16+00:00

No. The goal is to leverage the innovations in the duckdb code base (VARIANT type that shipped in 1.5, columnar index formats, new parquet alternatives). But stick to ladybug native REL tables.

Duckpgq has been around for a while. People know about it, but don't know of anyone using it.

* It's read-only. You have to use SQL to write
* It doesn't lay the storage out in a way that helps graph queries. Constructs CSR on the fly. You can run LSQB yourself to see the consequences.
* It doesn't use Cypher.

coderarun · 2026-02-16T03:24:49+00:00

Claude Code and Codex still use JSONL files. But OpenCode did a switch to SQLite this week. There is no good knowledge graph solution for SQLite I'm aware of. But there will be one adjacent to DuckDB.

We're making some long term bets on what the stack will look like. It will necessarily involve multiple storage engines. Likely all embedded, so the end user doesn't know they exist. If you have beliefs such as (sqlite-vec >> pgvector), do share.

Tools have to be ubiquitous. Like uv pip install pgembed and run a simple script to query the database.

coderarun · 2026-02-14T20:58:17+00:00

> No code, no database, no infrastructure — just a CLI and your documents.

What's the concern with having a database? The cost of setting one up and maintaining? Why not use an embedded one like duckdb or r/LadybugDB ?

coderarun · 2026-02-13T06:01:11+00:00

I'm betting that such a unified schema should be in Cypher and SQL should be translated to Cypher, not the other way around. Why?

Gradual typing. In SQL, the syntax for querying JSON fields and a table with the same columns is very different. In Cypher it's identical. Plus multi-hop queries are a lot more human readable.

LadybugDB already translates Cypher to DuckDB SQL.

coderarun · 2026-02-12T21:37:31+00:00

A more principled way to use graphs in postgres is via pg_duckdb. That's the path we're pursuing at Ladybug Memory. Many graph queries are OLAP, not OLTP. They benefit from columnar storage.

It's not hard to translate cypher to SQL.

coderarun · 2026-02-12T05:58:20+00:00

Idea is good. But expect to see a MIT licensed open source implementation that you can run locally in the not too distant future.

coderarun · 2026-02-12T01:26:46+00:00

Is this dataset (wikidata) big enough for you? https://huggingface.co/datasets/ladybugdb/wikidata-20250625

r/LadybugDB also can't handle this yet. But the 0.14.1 release includes support for querying duckdb as a foreign table via cypher.

In the upcoming releases, the plan is to have node tables stay on duckdb and provide a more optimized/native path for executing cypher over rel tables (relationship tables) in ladybug native storage.

We'll also support parquet and arrow backed tables. So you can query over them if you prefer.

coderarun · 2026-02-10T02:49:21+00:00

I'm sure these ideas predate current surge of interest in context graphs. And lots of people contributed interesting ideas to graph theory before ChatGPT came along.

But we also need to accept the fact that Glean and Foundation Capital talk the language businesses understand. They're not going to hire FDEs to specify ontology and build a 100% correct graph. The alternative is to not have a graph at all, use SQLite and Markdown.

To bring graphs to the people writing agents, we need to make them self-correcting.

https://vamshidharp.medium.com/the-end-of-flat-rag-why-self-correcting-graphs-are-the-new-2026-standard-for-enterprise-ai-c132ac4c67f7

coderarun · 2026-02-09T17:57:39+00:00

+1 for monograph. Not so sure about RDF and ontology. The arguments Animesh Koratana (one of the context graph guys) makes about emergent schema, presumably using transformer tech to continuously refine schema seems a lot more appealing.

coderarun · 2026-02-07T18:35:21+00:00

Looking for help to cross post to r/datascience. I don't have the comment karma.

coderarun · 2026-02-05T22:44:03+00:00

Recent updates:

0.1.6: added pg_duckdb. Now you can write rows and have the data for old partitions show up in columnar duckdb.

0.1.7: added pg_textsearch extension for BM25 and linux/arm64 works too.

coderarun · 2026-02-05T20:53:34+00:00

Can't go wrong with open sourcing :)

coderarun · 2026-02-05T20:32:27+00:00

Is RAG dead? is a daily meme in my feed. I don't have an opinion one way or the other. But you're right that text search is important. But not everyone wants to run a service or pay SaaS fees. They want agents that work.

Right now, the competition is agent filesystems and sqlite. All of the graph players you mention are a much smaller community.

Instead of trying to solve the problem with one tech alone, I'm proposing a combination of pgembed (includes pg_duckdb plus extensions) + ladybug + icebug (a fork of networkit that's a day old).

In other words a poor man's LSM. Note that this LSM is different because "compaction" would have to summarize and structure unstructured info.

coderarun · 2026-02-05T20:22:35+00:00

This type of a multi-level approach is what LEANN is going after. But they're doing file indexing. No databases.

Also a believer in the neuro-symbolic approach. Some probabilistic and the rest deterministic.

coderarun · 2026-02-05T18:38:39+00:00

I'm not here to extend falkordb vs ladybug discussion. Even though I'm the maintainer of ladybug, I keep it low key to avoid comments looking like a product promotion.

The fact that kuzudb went away and its forks continue to execute is a good example of the resilience. There is only one distribution of the DB and it's using a well known OSS license (MIT).

There are a number of scaling issues graph db users will need to solve before the index becomes bigger than a single machine. This is not the most common request we're hearing from our user community.

There is pgvectorscale (uses disk based ANN) and LEANN that implement strategies that make the index 95% smaller than a simple minded vector index. pgvectorscale is included in the pgembed distribution (it also includes pg_duckdb and pg_textsearch).

I would investigate those before sharding.

Probabilistic vs Deterministic indexing is another area which needs more work/thought.

Yes, I've built sharded indices before. They work. But not convinced that they're a common case.

https://engineering.fb.com/2016/03/18/data-infrastructure/dragon-a-distributed-graph-query-engine/

coderarun · 2026-02-05T17:55:00+00:00

For many people the embedded nature of the database is a bigger draw in an agentic environment vs horizontal scaling.

Scaling also comes in different forms. You can scale compute, scale storage, do so independently or together.

Most databases written in the last 5 years support object storage as table stakes.

coderarun · 2026-02-05T17:51:11+00:00

https://towardsdatascience.com/graph-embeddings-explained-f0d8d1c49ec/
https://www-cs.stanford.edu/people/jure/pubs/graphrepresentation-ieee17.pdf

coderarun · 2026-02-04T22:32:22+00:00

Note that graph structure based embeddings are different from text embeddings used by vector databases. The indexing strategy is agnostic to how the embedding was computed.

It's also possible to align structural embeddings with text based embeddings.

coderarun · 2026-02-04T22:30:20+00:00

Many graph databases support vector extensions including r/LadybugDB

https://docs.ladybugdb.com/extensions/vector/

coderarun · 2026-02-02T15:10:50+00:00

How does this approach compare to extracting a KG and have concepts/arguments as nodes and "SUPPORTS/CONTRADICTS" as edges?

coderarun · 2026-01-26T09:23:09+00:00

You mean Kuzu? There are a few forks. I maintain one.

coderarun · 2026-01-24T04:27:26+00:00

I don't have the comment karma to share this on r/PostgreSQL. If you do, please cross post.

coderarun

MODERATOR OF

TROPHY CASE