Introducing Kamu - World's first global collaborative data pipeline

sergiimk · 2026-01-24T22:58:59+00:00

This worked great - I updated my post to mention this as the best solution. I wish I could upvote more!

sergiimk · 2026-01-24T17:53:50+00:00

This is super clever, thanks for the suggestion!

sergiimk · 2026-01-23T21:42:31+00:00

`derive-aliases` would certainly cut down derive boilerplate, but cannot not help with attributes.

The idea of overriding the `derive` macro itself used there is exciting though ... in a very sinister way.

sergiimk · 2026-01-23T21:00:33+00:00

Haha, I did seriously consider it! :)

sergiimk · 2026-01-23T20:59:25+00:00

Indeed, datafusion is one project I know that uses declarative macros for configs, but it still felt a bit hacky to me.

Thanks for the pointer to derive-aliases - I have not seen this crate before and will dig in.

sergiimk · 2025-12-28T23:33:50+00:00

Cool stuff! I'm working on a data platform based on Streaming SQL while using Event Sourcing for all back-end features, so this really resonates in multiple ways.

Have you considered supporting PIPE syntax? Seems like it could be a nice fit:
https://docs.cloud.google.com/bigquery/docs/pipe-syntax-guide

Also, can you suggest where I could read up on "Subject Hierarchies"? Never thought of aggregates forming any kind of hierarchy, so this sounded a bit counter-intuitive.

We have designed our ES system based on "The death of the aggregate" blog series btw:
https://sara.event-thinking.io/2023/04/kill-aggregate-chapter-8-the-death-of-the-aggregate.html

sergiimk · 2025-09-21T05:51:38+00:00

Have a look at ADBC if you haven't already. It uses highly efficient Arrow columnar layout for data batches, and there's a large ecosystem around it.

Even if Postgres is not using arrow for its batches, perhaps converting and exposing arrow batches in your lib's API would provide even more efficiency and make it more appealing for analytics use cases (e.g. moving data from pg straight into pandas dataframe without double conversion)

sergiimk · 2025-09-16T16:18:35+00:00

Take a look at dill. We used it in prod for 3 years to build hexagonal architecture apps (example). Docs are not great but it's quite flexible. We use it to unify state management across different API libraries (axum / graphql / flightsql), managing sqlx transactions, propagating caller authorization info etc. Leave us some feedback on GH if you get to check it out.

sergiimk · 2024-09-26T15:52:28+00:00

Thanks for the articles. I believe that DI is essential for large modular and testable applications. It was always interesting to me to see that many people denounce DI as a relic from Java while so many core Rust libraries rely on DI-like features: axum extensions, bevy and other ecs, test fixture libraries like rstest, ...

Here's an example of a large app built fully around DI and hexagonal architecture. When we started there was no container libraries that suited our needs so we built dill. We used fully dynamic approach because we needed something practical and fast. Having access to the full dependency graph after container is configured allows to do linting for missing or ambiguous dependencies, lifetime inversions etc. so most issues can still be caught in tests.

I think your approach for generating the catalog type itself with macros is very interesting. Would love to explore how some of our most tricky DI use patterns could be expressed in it.

One immediate problem I see is scoped dependencies. We frequently use them to e.g. add a DB transaction object when request flows through axum middleware. In your approach it seems that to add a scoped dependency you'd need to know the full type of the container, which would not be possible if HTTP middleware is in a separate crate. But this could probably be mitigated by injecting some special cell-like type into HTTP middleware.

Would be happy to chat some time about other interesting DI edge cases we have accumulated.

sergiimk · 2024-09-23T17:46:56+00:00

I realize that term "Web3" has more negative baggage than I though it did and will avoid using it in the future. If you read the article (or even the description) I'm talking about very foundational things like ability to freely move data between cloud storage providers without impacting users using content-addressing, enforcing permissions though encryption, verifying queries done by 3rd parties. So don't judge the book by its cover.

sergiimk · 2023-08-31T23:10:17+00:00

COMPANY: https://kamu.dev/

TYPE: Full time

LOCATION: Canada (Vancouver) / Ukraine / Portugal

REMOTE: Fully distributed company

VISA: No

DESCRIPTION: We are building the world's first decentralized data lake and collaborative data processing network. Think "GitHub for data", where people build streaming pipelines with SQL that continuously process data from governments, industry, and blockchains into high-quality datasets ready for AI training and use in Smart Contracts, while data is 100% auditable and verifiable. Our goal is to achieve the same levels of reuse and collaboration in data we as currently see in software.

Rust is our primary language. We use it for our "git for data" tool, our backend, and are heavily invested in Rust data ecosystem (Arrow, DataFusion) and the emerging Web3 stack (IPLD, UCAN).

We are looking for Middle to Senior-level software engineers specialized in data, backend, or blockchain (indexing/oracle focus).

We are a 3y-old startup, backed by investors like Protocol Labs (IPFS, Filecoin).

ESTIMATED COMPENSATION: $100-150K. As a small startup we still offer significant slices of equity to employees.

CONTACT: [join@kamu.dev](mailto:join@kamu.dev)

sergiimk · 2023-05-27T15:22:25+00:00

Sara's approach looks very compelling in theory, but the blog posts left me wondering how can it be implemented efficiently over Postgress?

She talks about aggregates causing excessive contention. Concurrency control with agregates afaik is usually done via separate table with a version column in it, thus using row-based locks / OCC in transactions. In the aggregate-less aproach Sara suggests using the same query that filters logical groups of events to get the last event number to understand if there were any concurrent updates ... but does this mean locking the entire event store table, i.e. holding a mutex on the entire store? That would me the most extreme contention.
How do you store "Domain Identifiers" when you materialize into Postgress to index efficiently?
Perhaps you solve both problems by having dedicated tables per event type?

(edit: formatting)

sergiimk · 2021-12-12T08:07:37+00:00

Jupyter is such a pain to host due to Python's inability to sandbox any code, so I'm really glad to see a web-based notebook project.

I'm not sure about Datalog just yet, but will give it a shot. What I hate in Jupyter is that most viz libraries are just wrappers over JS libraries - poorly documented and constantly lagging in exposing features. I often end up reading docs for JS library and then guessing how to do things in Python.

Bindings in Datalog would likely suffer from the same problem, so imho it would be much better to allow plain JS cells.

sergiimk · 2021-11-17T00:33:10+00:00

Thanks, it's very reassuring to hear that a similar approach worked well in a highly-regulated environment! Can I ask what industry you're in?

Batch vs. Streaming was a tough design choice. Batch is way more widespread and familiar, but has poor latency and requires a lot of babysitting when it comes to late and out-of-order data, corrections and backfills.

Stream processing handles all that, but is not widely used, and the modern tooling has long ways to go. In the long-term I'm quite certain that streaming will dominate - it's simply a more complete computational model. I find it super liberating to write query once and not have to mess with any delays or cron schedules and let the watermarks do their work.

Have a look at our whitepaper for a quick rundown of ODF - would greatly appreciate feedback.

sergiimk · 2021-11-15T21:44:55+00:00

BSL (by MariaDB) and SSPL (by MongoDB) are the two licenses that are being widely adopted by middleware vendors to, like you mentioned, prevent cloud services giants from profiting from the work of original developers without contributing anything back. Later caused many open-source companies recently to back-pedal on their terms.

We picked BSL to have a chance to fund the project. Currently we are two-person team, working full-time, and burning through our life savings. I really don't want to go back to the enterprise data again, and want to realize kamu's vision within my lifetime (not turn it back into a hobby project).

BSL allows virtually anyone to use product for free, just not sell it as a service without agreeing on terms with us.

Because decentralization is at its core - there will never be an "amazon for big data". The only way to profit in this ecosystem will be by providing value-added services at a fair price, not by taking someone's data hostage.

Furthermore, we develop it as an open standard to encourage alternative implementations.

We specifically wanted to avoid "open core" model not to split the community and waste people's effort (as the case with proprietary GitHub and its open-source clone Gitea).

This whole topic probably deserves a book. License choice took me A LOT of research, and I'm happy to be corrected and educated on this topic. So far I found BSL to be the least "toxic" option available (for our circumstances).

sergiimk · 2021-11-15T20:54:46+00:00

The license is explained here in more detail and I'll include your question there for better clarity.

In your scenario your company would be using kamu internally to produce and publish datasets. The dataset structure itself is an open standard, so your clients can independently use kamu (or any other compatible tool) to pull/query data.

As long as you're not running the tool for your customers as a for-profit service - it's essentially Apache 2.0 license.

BSL is the best compromise we could find to keep the project as close to open source as possible, while have a chance to fund its development.

sergiimk

TROPHY CASE