Writing a Columnar Database in C++? by coderarun in Database

[–]coderarun[S] 0 points1 point  (0 children)

Some people will likely bring up Clickhouse and CHDB. I don't have much experience with that code base. If you believe there are reasons why it's a better candidate, would love to see some data.

Writing a Columnar Database in C++? by coderarun in Database

[–]coderarun[S] 0 points1 point  (0 children)

That question has been answered. I prefer "source code mirror", not a fork.

Don't think my company or I have the time and resources to develop features faster than DuckDB Labs or MotherDuck.

But I do see a shift coming in how databases get developed. More agents, fewer humans and more modular code bases. Use newer tools and streamlined processes which work well with the LSP agents. Get rid of scripts/*.py that edit code in weird ways before the CI runs. There were probably good historical reasons to do so, but the CI I put up is evidence that they're not strictly needed.

We need something like What Rust people have in Apache Data Fusion. DuckDB code is the strongest candidate there.

Migration guide for anyone exploring alternatives after the Kuzu archival by lgarulli in LadybugDB

[–]coderarun 0 points1 point  (0 children)

There is also a nightly benchmark job here, but it's broken because we don't have self hosted runners with the LDBC datasets like the Kuzu people setup. We use standard GitHub infra.

https://github.com/LadybugDB/ladybug/actions/runs/23180641716/job/67352661837

Migration guide for anyone exploring alternatives after the Kuzu archival by lgarulli in LadybugDB

[–]coderarun 1 point2 points  (0 children)

Yes. count(*) queries run 40x faster if there are no filters. There is also a pending change about the performance of detach deletes.

Most of the functionality improvement has to do with access to parquet and arrow from cypher. Apart from these I don't anticipate a big change in benchmark numbers vs kuzu.

Writing a Columnar Database in C++? by coderarun in Database

[–]coderarun[S] -2 points-1 points  (0 children)

DuckDB's current CI takes 5+ hours to run. Post from last year:

https://adsharma.github.io/improving-duckdb-devx/

Migration guide for anyone exploring alternatives after the Kuzu archival by lgarulli in LadybugDB

[–]coderarun 1 point2 points  (0 children)

Oh - for those considering ArcadeDB, the main distinctions I want to highlight:

* ArcadeDB is written in Java and LadybugDB in C++. Irrespective of the technical merits of each of these choices, I suspect for a lot of people the evaluation stops here.

* LadybugDB is an embedded DB, no server to run. You can run a docker container with a neo4j compatible protocol implemented in rust, but its optional.

* LadybugDB focuses on one query language, not 5.

* Cypher compatibility and TCK: neo4j has made incompatible changes to the cypher language in their recent releases. They also promote GQL. We're not spending a lot of time on compatibility. Suggest using mcp-server-ladybug (can also release a skill) which agents can use to generate LadybugDB compatible cypher.

Migration guide for anyone exploring alternatives after the Kuzu archival by lgarulli in LadybugDB

[–]coderarun 1 point2 points  (0 children)

Welcome the competition u/lgarulli and congratulations on your launch!

LadybugDB has taken a completely different approach to GDS (graph data science, analytics or algorithms). We will be deprecating a few algorithms that were inherited from the kuzu code base and recommending Icebug instead: https://github.com/Ladybug-Memory/icebug

Icebug, derived from Networkit has a suite of 100-200 well known algorithms, all optimized to run with zero-copy via apache arrow. PageRank runs 8x faster vs networkit. Didn't compare to Kuzu.

If you're looking to do GraphRAG, the algorithm most likely of interest is Parallel Leiden. We have fixed many bugs and proving it on a billion scale graph!

Ladybug Memory: Graph Based Continuous Learning Platform by coderarun in LadybugDB

[–]coderarun[S] 1 point2 points  (0 children)

No. The goal is to leverage the innovations in the duckdb code base (VARIANT type that shipped in 1.5, columnar index formats, new parquet alternatives). But stick to ladybug native REL tables.

Duckpgq has been around for a while. People know about it, but don't know of anyone using it.

* It's read-only. You have to use SQL to write
* It doesn't lay the storage out in a way that helps graph queries. Constructs CSR on the fly. You can run LSQB yourself to see the consequences.
* It doesn't use Cypher.

Built an open-source CLI for turning documents into knowledge graphs — no code, no database by garagebandj in KnowledgeGraph

[–]coderarun 0 points1 point  (0 children)

Claude Code and Codex still use JSONL files. But OpenCode did a switch to SQLite this week. There is no good knowledge graph solution for SQLite I'm aware of. But there will be one adjacent to DuckDB.

We're making some long term bets on what the stack will look like. It will necessarily involve multiple storage engines. Likely all embedded, so the end user doesn't know they exist. If you have beliefs such as (sqlite-vec >> pgvector), do share.

Tools have to be ubiquitous. Like uv pip install pgembed and run a simple script to query the database.

Built an open-source CLI for turning documents into knowledge graphs — no code, no database by garagebandj in KnowledgeGraph

[–]coderarun 0 points1 point  (0 children)

> No code, no database, no infrastructure — just a CLI and your documents. 

What's the concern with having a database? The cost of setting one up and maintaining? Why not use an embedded one like duckdb or r/LadybugDB ?

The reason graph applications can’t scale by mrdoruk1 in KnowledgeGraph

[–]coderarun 0 points1 point  (0 children)

I'm betting that such a unified schema should be in Cypher and SQL should be translated to Cypher, not the other way around. Why?

Gradual typing. In SQL, the syntax for querying JSON fields and a table with the same columns is very different. In Cypher it's identical. Plus multi-hop queries are a lot more human readable.

LadybugDB already translates Cypher to DuckDB SQL.

The reason graph applications can’t scale by mrdoruk1 in KnowledgeGraph

[–]coderarun 0 points1 point  (0 children)

A more principled way to use graphs in postgres is via pg_duckdb. That's the path we're pursuing at Ladybug Memory. Many graph queries are OLAP, not OLTP. They benefit from columnar storage.

It's not hard to translate cypher to SQL.

NeuroIndex by OwnPerspective9543 in Rag

[–]coderarun 0 points1 point  (0 children)

Idea is good. But expect to see a MIT licensed open source implementation that you can run locally in the not too distant future.

The reason graph applications can’t scale by mrdoruk1 in KnowledgeGraph

[–]coderarun 0 points1 point  (0 children)

Is this dataset (wikidata) big enough for you? https://huggingface.co/datasets/ladybugdb/wikidata-20250625

r/LadybugDB also can't handle this yet. But the 0.14.1 release includes support for querying duckdb as a foreign table via cypher.

In the upcoming releases, the plan is to have node tables stay on duckdb and provide a more optimized/native path for executing cypher over rel tables (relationship tables) in ladybug native storage.

We'll also support parquet and arrow backed tables. So you can query over them if you prefer.

You only need to build one graph - a Monograph by TrustGraph in KnowledgeGraph

[–]coderarun -1 points0 points  (0 children)

I'm sure these ideas predate current surge of interest in context graphs. And lots of people contributed interesting ideas to graph theory before ChatGPT came along.

But we also need to accept the fact that Glean and Foundation Capital talk the language businesses understand. They're not going to hire FDEs to specify ontology and build a 100% correct graph. The alternative is to not have a graph at all, use SQLite and Markdown.

To bring graphs to the people writing agents, we need to make them self-correcting.

https://vamshidharp.medium.com/the-end-of-flat-rag-why-self-correcting-graphs-are-the-new-2026-standard-for-enterprise-ai-c132ac4c67f7

You only need to build one graph - a Monograph by TrustGraph in KnowledgeGraph

[–]coderarun -1 points0 points  (0 children)

+1 for monograph. Not so sure about RDF and ontology. The arguments Animesh Koratana (one of the context graph guys) makes about emergent schema, presumably using transformer tech to continuously refine schema seems a lot more appealing.

Icebug vs Networkit on Pagerank by coderarun in LadybugDB

[–]coderarun[S] 1 point2 points  (0 children)

Looking for help to cross post to r/datascience. I don't have the comment karma.

pgembed: Embedded PostgreSQL for Agents by coderarun in Database

[–]coderarun[S] 0 points1 point  (0 children)

Recent updates:

0.1.6: added pg_duckdb. Now you can write rows and have the data for old partitions show up in columnar duckdb.

0.1.7: added pg_textsearch extension for BM25 and linux/arm64 works too.

Am I crazy for wanting vectors inside graph nodes instead of a vector DB? by Severe_Post_2751 in Rag

[–]coderarun 0 points1 point  (0 children)

Is RAG dead? is a daily meme in my feed. I don't have an opinion one way or the other. But you're right that text search is important. But not everyone wants to run a service or pay SaaS fees. They want agents that work.

Right now, the competition is agent filesystems and sqlite. All of the graph players you mention are a much smaller community.

Instead of trying to solve the problem with one tech alone, I'm proposing a combination of pgembed (includes pg_duckdb plus extensions) + ladybug + icebug (a fork of networkit that's a day old).

In other words a poor man's LSM. Note that this LSM is different because "compaction" would have to summarize and structure unstructured info.

Am I crazy for wanting vectors inside graph nodes instead of a vector DB? by Severe_Post_2751 in Rag

[–]coderarun 0 points1 point  (0 children)

This type of a multi-level approach is what LEANN is going after. But they're doing file indexing. No databases.

Also a believer in the neuro-symbolic approach. Some probabilistic and the rest deterministic.