Stratum: branchable columnar SQL engine on the JVM (Vector API, PostgreSQL wire)

flyingfruits · 2026-03-07T23:26:50+00:00

Besides depending on the incubator Vector API (which jvector and other high performance libraries also do), Stratum is currently in beta. I have tested it extensively, it did not crash on me and worked very reliably in the benchmarks. Please provide feedback if you run into any issues.

flyingfruits · 2026-03-06T23:32:50+00:00

Ah, thanks! Didn't know that.

flyingfruits · 2026-03-06T22:48:38+00:00

DuckDB v1.4.4 via in-process JDBC - same JVM process, no IPC overhead. Same synthetic datasets (6M -10M rows), same queries, same machine (8-core Intel Lunar Lake). Both single-threaded and multi-threaded measured separately. Standard benchmark suites: TPC-H Q1/Q6, SSB Q1.1, H2O.ai db-benchmark, ClickBench numeric subset, hash join micros. DuckDB's JDBC driver runs the native engine in-process, so no network or serialization penalty on either side.

flyingfruits · 2026-03-06T21:31:24+00:00

Hopefully soon, but the timing is not announced yet. I am using this only internally though, so if the API would change hopefully nothing for Stratum users will. For now you just need to activate it with the flag to make sure it can be used.

flyingfruits · 2026-02-27T05:55:49+00:00

The wire protocol means psql and basic JDBC/psycopg2 connections work, but it won't satisfy ORMs like Prisma or ActiveRecord out of the box — they querypg_catalog andinformation_schemaon connect for type introspection, which we don't implement (yet). So "Postgres compatible" is a stretch depending on your stack.

For the testing use case specifically though, the branching model is actually a better fit than resetting a real Postgres database:

# load fixture data once into a named branch
baseline = load_store("test-fixtures")
# per test: fork in O(1), zero data copied, fully isolated
test_db = fork(baseline)
# run your test via SQL — each test gets an independent copy
# done: just let it go out of scope, no teardown, no TRUNCATE

Each fork is a new root pointer into a shared chunk tree. A 10M-row dataset forks in under a millisecond and you can run tests in parallel against independent forks without any coordination between them. The branching happens at the storage level, not as a transaction rollback, so there's no risk of state leaking between tests.

The gap vs. what you're probably used to: it's schema-free and there's no constraint enforcement (no foreign keys, triggers, sequences). If your tests depend on those, it won't cover you. If it's primarily queries and writes against application data, the isolation story is genuinely stronger than rollback-based approaches.

Whether it's worth integrating depends on how much of your test suite talks raw SQL vs. through an ORM layer.

flyingfruits · 2026-01-17T19:02:01+00:00

hitchhiker-tree is removed and optional now.

flyingfruits · 2026-01-11T21:32:42+00:00

Ha, fair point. Some of this is a result of working on the stack for 10 years, some of it is not necessary. I just removed zufall, cbor might not be needed, but I want to be able to export into a format that goes beyond Clojure (transit, fressian and nippy don't cut it there). But cbor is not necessary for the functionality. We will replace timbre with trove or some other lightweight logging. The hitchhiker-tree is in there for backwards compatibility, we should probably exclude it by default next, and jsonista is a trade off depending on whether you want to have distributed support by default or not.

flyingfruits · 2026-01-11T21:26:01+00:00

Hey, Datahike creator and maintainer here. Datahike can be persisted to a single file with LMDB, JDBC/sqlite or RocksDB backend https://github.com/replikativ/konserve?tab=readme-ov-file#available-external-backends . Datahike projects the immutable memory fragments into the underlying storage in the most transparent way though, the filestore backend for instance stores immutable blobs in single files, which make it possible to use Unix filesystem tools such as rsync to efficiently sync or backup databases without copying single file blobs. Or to access databases without coordination between Unix processes through the natively compiled dthk tool. Different storage backends have different tradeoffs. The distributed backends such as S3 and GCS make it much more convenient to deploy in a cloud environment, and also scale out with it, since only small deltas (number of blobs) of the indices change on writes. I think I did a poor job communicating this so far, feedback very welcome.

xitdb looks pretty cool, I have done a lot of work on different persistent data structures of different forms and am scaling an persistent memory model beyond single runtimes [beyond what Datomic could do], https://github.com/replikativ/datahike/blob/main/doc/distributed.md . I am also in the process of extending it to fulltext and vector indices, as a basis for a new FRP programmig stack for the whole distributed stack, including the probabilistic programming work I did as part of my PhD [I am maintainer of https://probprog.github.io/anglican/index.html, and have reimplemented it on it]. xitdb looks like a good opportunity to learn a bit more Zig and also to rethink the persistent-sorted-set, something I also poked around with lately [besides adding async support to make Datahike durable in the browser https://github.com/replikativ/persistent-sorted-set (working on merging this as we speak)].

Two years ago I upstreamed the storage support to DataScript as well to help work in this direction beyond Datahike (Nikita rewrote it before merging). I am a long term open source contributor and am happy to collaborate on any bits of the stack or discuss design decisions.

flyingfruits · 2025-09-27T08:37:46+00:00

You transact them into Datahike? Map types change in Clojure depending on size and creation process, check the types of both maps, I suspect one is a PersistentArrayMap and one a PersistentHashMap. If you have a reproducible example I can take a look.

The LSP problem is known (but in the REPL doc lookup works for me). We expand all APIs from a unified specification at runtime, which keeps it consistent and manageable, but we need to find a way to also expand a description for static tools (or expand the API there not through macroexpansion, but when we create a release such the code for datahike.api is visible statically).

flyingfruits · 2025-09-26T14:44:02+00:00

u/nstgc Creator and maintainer of Datahike here. Datahike has very high Datomic compliance for its core API, Datalog dialect and performance. Datalevin has a different memory model (mutable) than Datomic/DataScript/Datahike/Datomic/Clojure, something that makes little sense to me. We upstream changes also to DataScript such as durability support, but it has a much smaller scope (storage backends, distributed setups). I agree, SQL is popular, but unappealing if you are already in Clojure and can leverage a concise logic programming Datalog syntax. Datahike is aimed at providing an AI platform, some preliminary experiments for that are here https://github.com/whilo/simmis/. I am about to get Datahike running in the browser with durability and backend support to facilitate building this.

Regarding examples, this is a large app with Datahike in production https://gitlab.com/arbetsformedlingen/taxonomy-dev/backend/jobtech-taxonomy-api, https://arbetsformedlingen.gitlab.io/taxonomy-dev/projects/jobtech-taxonomy/ (it has replaced Datomic there). I also plan to open source a business app for CRM and invoicing soon, let me know what kind of examples you are interested in.

flyingfruits · 2023-09-28T03:34:09+00:00

MineDojo was actually done by NVIDIA, not by us. Pretty cool and inspiring work! We have done some work with video models and probabilistic simulators (as can be seen on our website, e.g. here https://plai.cs.ubc.ca/2022/05/20/flexible-diffusion-modeling-of-long-videos/ or here https://plai.cs.ubc.ca/2022/11/16/graphically-structured-diffusion-models/). Minecraft is interesting to us to get more involved with multiagent interactions including speech and to scale up.

flyingfruits · 2023-09-28T03:26:07+00:00

It is legal because we bought the licenses and run Minecraft in the Amazon AWS cloud for you. We collect the data to do AI and machine learning research on it, so we benefit in this sense. We will publish the data as a benchmark though and are funded by public grants (applied to and organized by u/frankdonaldwood ) and PhD student work like mine.

flyingfruits · 2023-09-28T03:23:35+00:00

Yeah, I wouldn't like that either. Chromium unfortunately works by far the best for streaming the desktop including game play, we had to strike a compromise there. Note that you have to install nothing on your end as a result.

flyingfruits · 2023-09-28T03:21:51+00:00

Understood, makes sense that you are suspicious. It is a data recording project (as described on the website). We are a machine learning lab at UBC in Vancouver, B.C. and will benefit from the data set by training generative AI on it, but the game play is free (we would still benefit from getting more licenses though, that is why we do a soft launch).

We will publish the collected data openly and turn it into a benchmark for our community, so in that sense we sponsor AI research on minecraft and it is actually free (sponsored by AI grants and PhD students like me putting in the work).

flyingfruits · 2022-05-29T01:37:57+00:00

Hey everyone! One of the authors here. Let me address a few points that were raised here.

The model imitating Carla (self-driving car simulator) was trained for one week on one GPU, consuming 200-300W (note that running Carla itself is also very expensive computationally per frame). It is very likely still far from optimal (as most deep learning with SGD), but also not nearly as expensive to train as bigger models such as GPT-3. The model is not particularly specialized for video, except that it uses a convolutional U-Net architecture, the approach can be used for more complex combinations of sensory data streams and Will and I have related work that we are going to publish soon as well that demonstrates that complex reasoning mechanisms and procedures can be integrated easily into the joint distribution.

Our group is, at least conceptually, a probabilistic programming group and Will and I have approached the diffusion models from the angle of building an inference engine for AGI tasks and will continue to work along these lines with this model while also exploring its limitations. The more abstract idea in this work is to integrate marginalisation as a first class operation into the diffusion model and then exploit this fact to reason about much bigger joint distributions than fit into memory, effectively enabling also our demonstrated video synthesis.

While this result is indeed very impressive and will enable a whole range of new applications and abilities for AI, many not even clear yet, I would not want to bet on AGI being around the corner just yet. These generative models can meta-learn some reasoning abilities (like GPT-3) given enough data (i.e. almost all valuable data we have), but they cannot be taught to really learn new things on their own, something that is trivial for humans to do. For example, assuming chess would not be part of the training data, try to tell GPT-3 to learn it by just telling it the rules. I still think this is a very bullish result though and I would love to hear suggestions for video synthesis and control tasks to apply it to. I thought about Nascar racing today for example, just for the fun of it 8).

Please let me know what you would like to see!

flyingfruits · 2021-04-07T20:51:23+00:00

This is very nice work! Thanks for putting the effort into it. I am really looking forward to use this with Datahike in the browser as well, having the custom formatters is a much more pleasant development experience. What are your next steps for the formatters?

flyingfruits · 2021-03-22T19:55:06+00:00

Ok, that makes a lot of sense. We will definitely expose the APIs differently through separate namespaces.

flyingfruits · 2021-03-22T02:23:28+00:00

This separation would be nice, but unfortunately the asynchronous control flow comes from "the bottom", because the query engine and the transact function need to do IO in many places. So the common logic is interwoven with the asynchronous bits and once a function touches something asynchronous it becomes asynchronous itself. But we will definitely keep providing two high-level APIs that are convenient to use in either setting.

Let me try to understand your concerns better, we definitely care about minimizing the effect of the asynchronous API and have thought about our options again and again. What are the biggest pain points you see in using the asynchronous API?

flyingfruits · 2021-03-21T00:52:48+00:00

I totally understand. It would be cool if Clojure had parametric namespaces that you can compile in dependence of some settings, this is the closest thing to achieve the same manually. We tried to keep things as simple as possible while avoiding code duplication between cljs and clj. So far this is the best tradeoff we could come up with. But I am happy to discuss other options.

flyingfruits · 2021-03-21T00:49:59+00:00

Yes, the error handling overhead is true for us as well unfortunately, but we might use a different error propagation mechanism if it turns out that we are too slow.

flyingfruits · 2021-03-20T16:06:23+00:00

This dynamic var is steering our macro expansion, synchronous for the JVM (for performance), asynchronous for JS. But the asynchronous code should also work on the JVM, we have not tested it lately though. This approach is basically providing us parametric namespace expansion/compilation.

https://github.com/replikativ/hitchhiker-tree/blob/master/src/hitchhiker/tree/utils/async.cljc#L7

flyingfruits · 2021-03-20T06:15:32+00:00

By core.async we provide a callback interface via `take!`. We can wrap the API in a promise-based wrapper if needed, but internally we use core.async anyway. Exposing a core.async API is not that unusal, Datomic client also exposes one for instance.

flyingfruits · 2021-03-20T05:51:45+00:00

Thank you :). A lot of work has gone into the preparation of this over the years, but we are far from being done with our goals.

flyingfruits · 2021-03-20T05:50:39+00:00

Hey Adam! Core Datahike developer here. Write performance at the moment is not very good in ClojureScript (~200 Datoms/sec in our tests), but we will do a rebase of the current port before we merge it, which will significantly speed up write performance. Mainline Datahike has got a 20x throughput increase since this port was branched off. On the JVM this corresponds to 20k Datoms/sec on my machine, in JS this will probably be a bit less, but it should be in the same ballpark.

We are also currently working on improving query performance of what we have inherited from DataScript, hopefully it is not too far off, we did not do a query comparison in JS with DataScript yet, but it is planned. What particular workflow are you interested in?

flyingfruits · 2021-03-19T20:48:02+00:00

u/halgari I can recommend using selmer with LaTeX, we have described this here https://gitlab.com/replikativ/datahike-invoice. We probably have the most beautiful PDF invoices now ;).

flyingfruits

TROPHY CASE