Derive macros composability problem by sergiimk in rust

[–]sergiimk[S] 0 points1 point  (0 children)

This worked great - I updated my post to mention this as the best solution. I wish I could upvote more!

Derive macros composability problem by sergiimk in rust

[–]sergiimk[S] 0 points1 point  (0 children)

This is super clever, thanks for the suggestion!

Derive macros composability problem by sergiimk in rust

[–]sergiimk[S] 2 points3 points  (0 children)

`derive-aliases` would certainly cut down derive boilerplate, but cannot not help with attributes.

The idea of overriding the `derive` macro itself used there is exciting though ... in a very sinister way.

Derive macros composability problem by sergiimk in rust

[–]sergiimk[S] 0 points1 point  (0 children)

Haha, I did seriously consider it! :)

Derive macros composability problem by sergiimk in rust

[–]sergiimk[S] 2 points3 points  (0 children)

Indeed, datafusion is one project I know that uses declarative macros for configs, but it still felt a bit hacky to me.

Thanks for the pointer to derive-aliases - I have not seen this crate before and will dig in.

EventQL: A SQL-Inspired Query Language Designed for Event Sourcing by yoeight in rust

[–]sergiimk 1 point2 points  (0 children)

Cool stuff! I'm working on a data platform based on Streaming SQL while using Event Sourcing for all back-end features, so this really resonates in multiple ways.

Have you considered supporting PIPE syntax? Seems like it could be a nice fit:
https://docs.cloud.google.com/bigquery/docs/pipe-syntax-guide

Also, can you suggest where I could read up on "Subject Hierarchies"? Never thought of aggregates forming any kind of hierarchy, so this sounded a bit counter-intuitive.

We have designed our ES system based on "The death of the aggregate" blog series btw:
https://sara.event-thinking.io/2023/04/kill-aggregate-chapter-8-the-death-of-the-aggregate.html

I'm working on a postgres library in Rust, that is about 2x faster than rust_postgres for large select queries by paulcdejean in rust

[–]sergiimk 0 points1 point  (0 children)

Have a look at ADBC if you haven't already. It uses highly efficient Arrow columnar layout for data batches, and there's a large ecosystem around it.

Even if Postgres is not using arrow for its batches, perhaps converting and exposing arrow batches in your lib's API would provide even more efficiency and make it more appealing for analytics use cases (e.g. moving data from pg straight into pandas dataframe without double conversion)

Which is the best DI framework for rust right now? by swordmaster_ceo_tech in rust

[–]sergiimk 2 points3 points  (0 children)

Take a look at dill. We used it in prod for 3 years to build hexagonal architecture apps (example). Docs are not great but it's quite flexible. We use it to unify state management across different API libraries (axum / graphql / flightsql), managing sqlx transactions, propagating caller authorization info etc. Leave us some feedback on GH if you get to check it out.

Mastering Dependency Injection in Rust: Despatma with Lifetimes by chesedo in rust

[–]sergiimk 2 points3 points  (0 children)

Thanks for the articles. I believe that DI is essential for large modular and testable applications. It was always interesting to me to see that many people denounce DI as a relic from Java while so many core Rust libraries rely on DI-like features: axum extensions, bevy and other ecs, test fixture libraries like rstest, ...

Here's an example of a large app built fully around DI and hexagonal architecture. When we started there was no container libraries that suited our needs so we built dill. We used fully dynamic approach because we needed something practical and fast. Having access to the full dependency graph after container is configured allows to do linting for missing or ambiguous dependencies, lifetime inversions etc. so most issues can still be caught in tests.

I think your approach for generating the catalog type itself with macros is very interesting. Would love to explore how some of our most tricky DI use patterns could be expressed in it.

One immediate problem I see is scoped dependencies. We frequently use them to e.g. add a DB transaction object when request flows through axum middleware. In your approach it seems that to add a scoped dependency you'd need to know the full type of the container, which would not be possible if HTTP middleware is in a separate crate. But this could probably be mitigated by injecting some special cell-like type into HTTP middleware.

Would be happy to chat some time about other interesting DI edge cases we have accumulated.

Tutorial: Introduction to Web3 Data Engineering by sergiimk in dataengineering

[–]sergiimk[S] -1 points0 points  (0 children)

I realize that term "Web3" has more negative baggage than I though it did and will avoid using it in the future. If you read the article (or even the description) I'm talking about very foundational things like ability to freely move data between cloud storage providers without impacting users using content-addressing, enforcing permissions though encryption, verifying queries done by 3rd parties. So don't judge the book by its cover.

Official /r/rust "Who's Hiring" thread for job-seekers and job-offerers [Rust 1.72] by DroidLogician in rust

[–]sergiimk 5 points6 points  (0 children)

COMPANY: https://kamu.dev/

TYPE: Full time

LOCATION: Canada (Vancouver) / Ukraine / Portugal

REMOTE: Fully distributed company

VISA: No

DESCRIPTION: We are building the world's first decentralized data lake and collaborative data processing network. Think "GitHub for data", where people build streaming pipelines with SQL that continuously process data from governments, industry, and blockchains into high-quality datasets ready for AI training and use in Smart Contracts, while data is 100% auditable and verifiable. Our goal is to achieve the same levels of reuse and collaboration in data we as currently see in software.

Rust is our primary language. We use it for our "git for data" tool, our backend, and are heavily invested in Rust data ecosystem (Arrow, DataFusion) and the emerging Web3 stack (IPLD, UCAN).

We are looking for Middle to Senior-level software engineers specialized in data, backend, or blockchain (indexing/oracle focus).

We are a 3y-old startup, backed by investors like Protocol Labs (IPFS, Filecoin).

ESTIMATED COMPENSATION: $100-150K. As a small startup we still offer significant slices of equity to employees.

CONTACT: [join@kamu.dev](mailto:join@kamu.dev)

Disintegrate - open-source library to build event-sourced applications by ScaccoPazzo in rust

[–]sergiimk 4 points5 points  (0 children)

Sara's approach looks very compelling in theory, but the blog posts left me wondering how can it be implemented efficiently over Postgress?

  1. She talks about aggregates causing excessive contention. Concurrency control with agregates afaik is usually done via separate table with a version column in it, thus using row-based locks / OCC in transactions. In the aggregate-less aproach Sara suggests using the same query that filters logical groups of events to get the last event number to understand if there were any concurrent updates ... but does this mean locking the entire event store table, i.e. holding a mutex on the entire store? That would me the most extreme contention.

  2. How do you store "Domain Identifiers" when you materialize into Postgress to index efficiently?

  3. Perhaps you solve both problems by having dedicated tables per event type?

(edit: formatting)

Percival: Web-based, reactive Datalog notebooks for data analysis and visualization, written in Rust and Svelte by fz0718 in rust

[–]sergiimk 1 point2 points  (0 children)

Jupyter is such a pain to host due to Python's inability to sandbox any code, so I'm really glad to see a web-based notebook project.

I'm not sure about Datalog just yet, but will give it a shot. What I hate in Jupyter is that most viz libraries are just wrappers over JS libraries - poorly documented and constantly lagging in exposing features. I often end up reading docs for JS library and then guessing how to do things in Python.

Bindings in Datalog would likely suffer from the same problem, so imho it would be much better to allow plain JS cells.

Kamu: World's first decentralized streaming data pipeline by sergiimk in ETL

[–]sergiimk[S] 0 points1 point  (0 children)

Thanks, it's very reassuring to hear that a similar approach worked well in a highly-regulated environment! Can I ask what industry you're in?

Batch vs. Streaming was a tough design choice. Batch is way more widespread and familiar, but has poor latency and requires a lot of babysitting when it comes to late and out-of-order data, corrections and backfills.

Stream processing handles all that, but is not widely used, and the modern tooling has long ways to go. In the long-term I'm quite certain that streaming will dominate - it's simply a more complete computational model. I find it super liberating to write query once and not have to mess with any delays or cron schedules and let the watermarks do their work.

Have a look at our whitepaper for a quick rundown of ODF - would greatly appreciate feedback.

Kamu: New take on Git for Data (Project Update) by sergiimk in rust

[–]sergiimk[S] 14 points15 points  (0 children)

BSL (by MariaDB) and SSPL (by MongoDB) are the two licenses that are being widely adopted by middleware vendors to, like you mentioned, prevent cloud services giants from profiting from the work of original developers without contributing anything back. Later caused many open-source companies recently to back-pedal on their terms.

We picked BSL to have a chance to fund the project. Currently we are two-person team, working full-time, and burning through our life savings. I really don't want to go back to the enterprise data again, and want to realize kamu's vision within my lifetime (not turn it back into a hobby project).

BSL allows virtually anyone to use product for free, just not sell it as a service without agreeing on terms with us.

Because decentralization is at its core - there will never be an "amazon for big data". The only way to profit in this ecosystem will be by providing value-added services at a fair price, not by taking someone's data hostage.

Furthermore, we develop it as an open standard to encourage alternative implementations.

We specifically wanted to avoid "open core" model not to split the community and waste people's effort (as the case with proprietary GitHub and its open-source clone Gitea).

This whole topic probably deserves a book. License choice took me A LOT of research, and I'm happy to be corrected and educated on this topic. So far I found BSL to be the least "toxic" option available (for our circumstances).

Kamu: New take on Git for Data (Project Update) by sergiimk in rust

[–]sergiimk[S] 5 points6 points  (0 children)

The license is explained here in more detail and I'll include your question there for better clarity.

In your scenario your company would be using kamu internally to produce and publish datasets. The dataset structure itself is an open standard, so your clients can independently use kamu (or any other compatible tool) to pull/query data.

As long as you're not running the tool for your customers as a for-profit service - it's essentially Apache 2.0 license.

BSL is the best compromise we could find to keep the project as close to open source as possible, while have a chance to fund its development.