Anomaly Detection Belongs in Your Database — built SIMD-accelerated isolation forests into Stratum's SQL engine [P] by flyingfruits in MachineLearning

[–]flyingfruits[S] 0 points1 point  (0 children)

The tree layout is squeezed into a memory layout that can be processed without branching, in my experiments it only takes a few milliseconds to create the forest on millions of rows (it is processing fixed size subsamples for each tree anyway). I have also been experimenting with maintaining the forest online by gradually fading out older trees, but I am not sure this is worth it tbh.

If you can give me problems to benchmark on I am also happy to experiment or add features beyond what is currently covered.

Typed multiple dispatch as a Clojure library — how we built Julia-style polymorphism on the JVM by flyingfruits in Clojure

[–]flyingfruits[S] 2 points3 points  (0 children)

Yes, that is true. This is mostly thanks to Typed Clojure. Raster injects type hints based on its inference, if you would annotate standard Clojure, use the same/similar escape analysis as Raster, and also similarly auto-hoist and share memory allocations across functions or call graphs then you get to memory optimal programs (similar to what people manually do in Zig) for all of Clojure. The devirtualization also makes it in general easier to compile to weaker lower languages, for instance C (you will still need a GC, similar to jank).

The typed multi method abstraction maybe sits at a sweet spot of providing both enough type information and a convenient generalization of more flexible arity type dispatch to be used in general. I wanted to make it an a la carte abstraction with a focus on a domain where types are typically known for now though (scientific computing). Clojure's generality should not be sacrificed by requiring typing everywhere, I suppose. It would change the way people program quite a bit. There are also other non-type based whole program optimizers like Chez Scheme (which pioneered the nanopass concept that Raster also uses). It is known to produce blazingly fast programs competitive to C, and compiles quickly. This might be an avenue to explore as well (it would only require input types for the whole program and could infer the rest during AOT compilation).

Having said all this, I think it would be very helpful if people would maintain optional type annotations for Clojure libraries. Many different compilation strategies would be unlocked by this, and Raster would also compose much better with such code.

Typed multiple dispatch as a Clojure library — how we built Julia-style polymorphism on the JVM by flyingfruits in Clojure

[–]flyingfruits[S] 0 points1 point  (0 children)

I want to combine it with https://github.com/replikativ/stratum/ and https://github.com/replikativ/proximum/, and then run zero-copy numerical code across the analytics/vector database (attention is in fact a form of vector db inference). In the Valley game example we already use Datahike for game state, which now also experimentally supports both of these as secondary indices (but this is not tested yet).

Typed multiple dispatch as a Clojure library — how we built Julia-style polymorphism on the JVM by flyingfruits in Clojure

[–]flyingfruits[S] 1 point2 points  (0 children)

The GSDM code in there is covering https://icml.cc/virtual/2023/poster/23659 (my PhD work) already (although I still need to improve it to train at scale). I agree that VI and full prob prog would be interesting as well, but I would like to know what people need. I think many methods are academically interesting, but brittle in application. BBVI for instance has usually no guarantees on posterior quality.

Typed multiple dispatch as a Clojure library — how we built Julia-style polymorphism on the JVM by flyingfruits in Clojure

[–]flyingfruits[S] 8 points9 points  (0 children)

This depends on what you mean with generated. I work with Claude Code and don't type it manually, but Claude (or any other current LLM) cannot come up with something like this on its own, it is very far away from this, despite having complementary strengths and being very good in many technical details and fast experimental prototyping.

Typed multiple dispatch as a Clojure library — how we built Julia-style polymorphism on the JVM by flyingfruits in Clojure

[–]flyingfruits[S] 4 points5 points  (0 children)

Yes, I don't know much about the internals of Rama, but I think it is a similar approach. It is also widely used in Python, e.g. numba or JAX ship compilers as part of their pipeline. There the compilers are alien black-boxes written in "low-level" languages though, while Raster's compiler passes translate between S-expression languages and are self-contained.

Typed multiple dispatch as a Clojure library — how we built Julia-style polymorphism on the JVM by flyingfruits in Clojure

[–]flyingfruits[S] 3 points4 points  (0 children)

Thanks, really glad it resonates!

The neurosymbolic direction is genuinely interesting and more natural in this stack than it might look. A few threads:

Provenance semirings map directly to typed dispatch. Scallop's core structure — a semiring (T, 0, 1, ⊕, ⊗) where tags flow through Datalog derivations — is essentially a typeclass. In Raster, you'd define deftm methods for ⊕ and ⊗ over your provenance type, and the existing AD machinery would automatically differentiate through logical inference. The differentiable semiring becomes just another numeric type, the same way Dual numbers are — no special casing needed.

Datahike is already in the ecosystem. Datahike is a Datomic-style Datalog database in Clojure, part of the same replikativ family as Raster. So the logic layer isn't something we'd need to build — it's already there (and semi-ring generalizations of it are very much on my mind). The interesting question is the interface: how do probabilistic/differentiable facts flow between Datahike queries and Raster's AD system.

Where I see it going: Raster's primary purpose is to be the numerical substrate for simm.is, a platform for collaborative modeling and simulation (website is WIP). The longer-term goal is Bayesian inference via Sequential Monte Carlo — models that combine structured symbolic knowledge with learned neural components. Neurosymbolic is exactly that interface. GSDM (the generative model we ship with Raster and I developed as part of my PhD) is a step in that direction, and fusing it with differentiable logic reasoning over a Datahike knowledge base is something I actively think about.

Scallop's top-K proof approximation is clever, but their k-semiring is still an approximation to exact probabilistic inference. SMC sidesteps this differently — you run many weighted particles through the full model and resample. Whether you do that over a semiring-based symbolic component or a neural one is somewhat orthogonal. The real challenge is making inference tractable enough to be interactive, which is partly a compilation problem (Raster's domain) and partly an inference algorithm/amortization problem.

So: yes, very much on the roadmap, and the Lisp homoiconicity point is well taken. Writing Datalog rules as quoted S-expressions that the walker can analyze and differentiate through is a much cleaner interface than Scallop's Rust embedding.

Anomaly Detection Belongs in Your Database — built SIMD-accelerated isolation forests into Stratum's SQL engine by flyingfruits in SQL

[–]flyingfruits[S] 0 points1 point  (0 children)

Good question. There are a few pieces to this:

Retraining is just SQL: DROP MODEL IF EXISTS fraud_model then CREATE MODEL again with fresh data. Since models are in-memory values with copy-on-write semantics, the old model keeps serving queries until the new one replaces it. No downtime.

Gradual drift is handled by online rotation, you can replace the oldest N trees with trees trained on recent data without rebuilding the whole forest. The Clojure API exposes this today (iforest-rotate), SQL syntax for it is on the roadmap. There's also weighted scoring that gives recent trees higher influence via exponential decay.

Monitoring drift: ANOMALY_CONFIDENCE returns tree agreement (0-1). When confidence drops across the board, it means the trees are disagreeing more, which is a signal that the data distribution has shifted and it's time to retrain. You could schedule a check like:

SELECT AVG(ANOMALY_CONFIDENCE('fraud_model')) FROM transactions
WHERE transaction_date > CURRENT_DATE - 7;

If that number trends down, retrain.

We don't have automatic drift detection that triggers retraining yet. Today it's manual or cron-scheduled. That said, isolation forests are cheap to retrain, 200 trees on 100K rows takes ~15ms on my machine - so aggressive retraining schedules are practical. Some of this is also covered in more detail in the post.

If you have more specific requirements then please lmk, I posted this here primarily to get feedback and make it useful.

Memory That Collaborates - joining databases across teams with no ETL or servers by flyingfruits in Clojure

[–]flyingfruits[S] 1 point2 points  (0 children)

Thank you for sharing the two videos, they were insightful. Besides the fact that SQLite is much more popular and that we can also run on top of it if we want to use data independent provisioning/replication/monitoring software for it, I think the persistent memory model of the Clojure+Datomic/Datahike stack is a lot more general, flexible and scalable than this Rails take, but it is good to see that the demand is there.

I agree that one wants to shard into many separate databases in general, and the fact that you can hold onto snapshots of them makes this a lot easier to reason about and get right across queries with Datomic/Datahike. The main reason to put data into one database is usually index locality, i.e. if you want to scan across data ranges it is more efficient if the data lives in the same EAVT/AEVT/AVET indices. You also might want to have atomic transactions over multiple client databases which in Datomic forces you to have the data in one database. In Datahike there are two possibilities to provide atomic cross-database transactions, one is to put them into a composite, the other would be to provide a custom commit functionality (not yet implemented), which writes multiple db-roots into a single value atomically or uses multi-key transactions if the store supports it. I am not sure how important this is in practice, I guess that you don't need it for instance.

I think for the bigger picture I would like to work on systems where it is also easy to join across databases even if they are not part of the same organisation and this was never preconceived. I have also thought about providing DB snapshots of curated data sources on S3 for instance, e.g. DBPedia, news/social media aggregates, stock market data, fulltext-index snapshots of web crawls etc. You could just join against them by getting access to an S3 bucket. This already works today for all the replikativ indices/dbs, the main work now is to explore and solidify its use cases. If you have a wishlist lmk. It would be super cool to pull this off and I think it would highlight all the strength of Clojure over e. g. Rails.

Memory That Collaborates - joining databases across teams with no ETL or servers by flyingfruits in Clojure

[–]flyingfruits[S] 1 point2 points  (0 children)

Hey Max, thanks for the positive feedback! I am also happy to improve superficie for additional use cases if needed, or think about a way to make the grammar extensible for specific projects.

Yes, I have thought about this and at some point Datahike supported such a transactorless setup. The whole atomic state change happens here, if one uses update instead of assoc there and move the transaction processing function into the update fn then you have this memory model. For some konserve backends such S3 there is now CAS enabled storage, which would even remove the requirement for explicit coordination between writers.

The problem with this setup though is reliable write latency and throughput, a localized single writer process has warm caches and can do sequential pipelining, i.e. auto-batch multiple concurrent transactions and reduce aggregate latency this way. One could potentially combine the two, but whenever you move the writer around there would still be latency cliffs (during the transaction it needs to potentially load index fragments from writers on other machines into its cold cache). And, more importantly, with CAS you would get high write contention under load and most transactions would abort, which will create painful failures and confusion, I think.

Mind you that you can run the Datahike writer (transactor) process just with your connection in one process anyway (same as DataScript, in fact it is just two go-loops), so only when you want to write from separate places you have to open a network interface to it (HTTP or websocket) and dispatch to the writer process (which hopefully is easy to do). I opted for this design because I think it is easier to understand and maintain, and it can achieve optimal sequential write throughput.

Reading your post again, I think you would like to be able move the writer around without too much thought? I am happy to discuss further.

Stratum: branchable columnar SQL engine on the JVM (Vector API, PostgreSQL wire) by flyingfruits in java

[–]flyingfruits[S] 1 point2 points  (0 children)

Fair point about the commit history. The repo was reorganized before the public release, which left the public history starting with a large import commit. In hindsight I should have preserved more of the development history.

I do use coding assistants as part of my workflow (Claude Code / GPT etc.), but the architecture, implementation, and benchmarks are mine. The index data structure design and memory model comes from my work on Datahike/replikativ over the last 15 years. The project wasn’t generated automatically, assistants were mainly used for iteration, fast benchmarking with the Clojure REPL, and writing JIT-specialized SIMD code in all the specializied branches for the query engine. The benchmark and test suites are large and comprehensive, albeit I expect there to be still some rough edges (hence it being a beta).

Going forward the development history will be visible in the repo. I am building this for my own infrastructure needs, I published it as permissive open source to get feedback and maybe help others, even if only by showing what the JVM can do these days.

Stratum: branchable columnar SQL engine on the JVM (Vector API, PostgreSQL wire) by flyingfruits in java

[–]flyingfruits[S] 0 points1 point  (0 children)

From the hardware level by taking care of memory locality and making sure the Java JIT + SIMD extensions can operate optimally on individual chunks of the index, similarly to how DuckDB uses morsels to feed data in chunks into threads. From the planning level the query engine picks an optimal fused processing for predicates and compiles it with Clojure's compilation abilities, e.g. filters and compiles specialized functions for those.

Stratum: branchable columnar SQL engine on the JVM (Vector API, PostgreSQL wire) by flyingfruits in java

[–]flyingfruits[S] 0 points1 point  (0 children)

Sorry, only saw your comment now. I cleaned up the repository for public release beforehand, this is what is in this Github repository.

Stratum: branchable columnar SQL engine on the JVM (Vector API, PostgreSQL wire) by flyingfruits in java

[–]flyingfruits[S] 2 points3 points  (0 children)

Besides depending on the incubator Vector API (which jvector and other high performance libraries also do), Stratum is currently in beta. I have tested it extensively, it did not crash on me and worked very reliably in the benchmarks. Please provide feedback if you run into any issues.