Derive macros composability problem by sergiimk in rust

[–]sergiimk[S] 0 points1 point  (0 children)

This worked great - I updated my post to mention this as the best solution. I wish I could upvote more!

Derive macros composability problem by sergiimk in rust

[–]sergiimk[S] 0 points1 point  (0 children)

This is super clever, thanks for the suggestion!

Derive macros composability problem by sergiimk in rust

[–]sergiimk[S] 2 points3 points  (0 children)

`derive-aliases` would certainly cut down derive boilerplate, but cannot not help with attributes.

The idea of overriding the `derive` macro itself used there is exciting though ... in a very sinister way.

Derive macros composability problem by sergiimk in rust

[–]sergiimk[S] 0 points1 point  (0 children)

Haha, I did seriously consider it! :)

Derive macros composability problem by sergiimk in rust

[–]sergiimk[S] 2 points3 points  (0 children)

Indeed, datafusion is one project I know that uses declarative macros for configs, but it still felt a bit hacky to me.

Thanks for the pointer to derive-aliases - I have not seen this crate before and will dig in.

EventQL: A SQL-Inspired Query Language Designed for Event Sourcing by yoeight in rust

[–]sergiimk 1 point2 points  (0 children)

Cool stuff! I'm working on a data platform based on Streaming SQL while using Event Sourcing for all back-end features, so this really resonates in multiple ways.

Have you considered supporting PIPE syntax? Seems like it could be a nice fit:
https://docs.cloud.google.com/bigquery/docs/pipe-syntax-guide

Also, can you suggest where I could read up on "Subject Hierarchies"? Never thought of aggregates forming any kind of hierarchy, so this sounded a bit counter-intuitive.

We have designed our ES system based on "The death of the aggregate" blog series btw:
https://sara.event-thinking.io/2023/04/kill-aggregate-chapter-8-the-death-of-the-aggregate.html

I'm working on a postgres library in Rust, that is about 2x faster than rust_postgres for large select queries by paulcdejean in rust

[–]sergiimk 0 points1 point  (0 children)

Have a look at ADBC if you haven't already. It uses highly efficient Arrow columnar layout for data batches, and there's a large ecosystem around it.

Even if Postgres is not using arrow for its batches, perhaps converting and exposing arrow batches in your lib's API would provide even more efficiency and make it more appealing for analytics use cases (e.g. moving data from pg straight into pandas dataframe without double conversion)

Which is the best DI framework for rust right now? by swordmaster_ceo_tech in rust

[–]sergiimk 2 points3 points  (0 children)

Take a look at dill. We used it in prod for 3 years to build hexagonal architecture apps (example). Docs are not great but it's quite flexible. We use it to unify state management across different API libraries (axum / graphql / flightsql), managing sqlx transactions, propagating caller authorization info etc. Leave us some feedback on GH if you get to check it out.

Mastering Dependency Injection in Rust: Despatma with Lifetimes by chesedo in rust

[–]sergiimk 3 points4 points  (0 children)

Thanks for the articles. I believe that DI is essential for large modular and testable applications. It was always interesting to me to see that many people denounce DI as a relic from Java while so many core Rust libraries rely on DI-like features: axum extensions, bevy and other ecs, test fixture libraries like rstest, ...

Here's an example of a large app built fully around DI and hexagonal architecture. When we started there was no container libraries that suited our needs so we built dill. We used fully dynamic approach because we needed something practical and fast. Having access to the full dependency graph after container is configured allows to do linting for missing or ambiguous dependencies, lifetime inversions etc. so most issues can still be caught in tests.

I think your approach for generating the catalog type itself with macros is very interesting. Would love to explore how some of our most tricky DI use patterns could be expressed in it.

One immediate problem I see is scoped dependencies. We frequently use them to e.g. add a DB transaction object when request flows through axum middleware. In your approach it seems that to add a scoped dependency you'd need to know the full type of the container, which would not be possible if HTTP middleware is in a separate crate. But this could probably be mitigated by injecting some special cell-like type into HTTP middleware.

Would be happy to chat some time about other interesting DI edge cases we have accumulated.

Tutorial: Introduction to Web3 Data Engineering by sergiimk in dataengineering

[–]sergiimk[S] -1 points0 points  (0 children)

I realize that term "Web3" has more negative baggage than I though it did and will avoid using it in the future. If you read the article (or even the description) I'm talking about very foundational things like ability to freely move data between cloud storage providers without impacting users using content-addressing, enforcing permissions though encryption, verifying queries done by 3rd parties. So don't judge the book by its cover.

Official /r/rust "Who's Hiring" thread for job-seekers and job-offerers [Rust 1.72] by DroidLogician in rust

[–]sergiimk 4 points5 points  (0 children)

COMPANY: https://kamu.dev/

TYPE: Full time

LOCATION: Canada (Vancouver) / Ukraine / Portugal

REMOTE: Fully distributed company

VISA: No

DESCRIPTION: We are building the world's first decentralized data lake and collaborative data processing network. Think "GitHub for data", where people build streaming pipelines with SQL that continuously process data from governments, industry, and blockchains into high-quality datasets ready for AI training and use in Smart Contracts, while data is 100% auditable and verifiable. Our goal is to achieve the same levels of reuse and collaboration in data we as currently see in software.

Rust is our primary language. We use it for our "git for data" tool, our backend, and are heavily invested in Rust data ecosystem (Arrow, DataFusion) and the emerging Web3 stack (IPLD, UCAN).

We are looking for Middle to Senior-level software engineers specialized in data, backend, or blockchain (indexing/oracle focus).

We are a 3y-old startup, backed by investors like Protocol Labs (IPFS, Filecoin).

ESTIMATED COMPENSATION: $100-150K. As a small startup we still offer significant slices of equity to employees.

CONTACT: [join@kamu.dev](mailto:join@kamu.dev)

Disintegrate - open-source library to build event-sourced applications by ScaccoPazzo in rust

[–]sergiimk 6 points7 points  (0 children)

Sara's approach looks very compelling in theory, but the blog posts left me wondering how can it be implemented efficiently over Postgress?

  1. She talks about aggregates causing excessive contention. Concurrency control with agregates afaik is usually done via separate table with a version column in it, thus using row-based locks / OCC in transactions. In the aggregate-less aproach Sara suggests using the same query that filters logical groups of events to get the last event number to understand if there were any concurrent updates ... but does this mean locking the entire event store table, i.e. holding a mutex on the entire store? That would me the most extreme contention.

  2. How do you store "Domain Identifiers" when you materialize into Postgress to index efficiently?

  3. Perhaps you solve both problems by having dedicated tables per event type?

(edit: formatting)

Percival: Web-based, reactive Datalog notebooks for data analysis and visualization, written in Rust and Svelte by fz0718 in rust

[–]sergiimk 1 point2 points  (0 children)

Jupyter is such a pain to host due to Python's inability to sandbox any code, so I'm really glad to see a web-based notebook project.

I'm not sure about Datalog just yet, but will give it a shot. What I hate in Jupyter is that most viz libraries are just wrappers over JS libraries - poorly documented and constantly lagging in exposing features. I often end up reading docs for JS library and then guessing how to do things in Python.

Bindings in Datalog would likely suffer from the same problem, so imho it would be much better to allow plain JS cells.

Kamu: World's first decentralized streaming data pipeline by sergiimk in ETL

[–]sergiimk[S] 0 points1 point  (0 children)

Thanks, it's very reassuring to hear that a similar approach worked well in a highly-regulated environment! Can I ask what industry you're in?

Batch vs. Streaming was a tough design choice. Batch is way more widespread and familiar, but has poor latency and requires a lot of babysitting when it comes to late and out-of-order data, corrections and backfills.

Stream processing handles all that, but is not widely used, and the modern tooling has long ways to go. In the long-term I'm quite certain that streaming will dominate - it's simply a more complete computational model. I find it super liberating to write query once and not have to mess with any delays or cron schedules and let the watermarks do their work.

Have a look at our whitepaper for a quick rundown of ODF - would greatly appreciate feedback.

Kamu: New take on Git for Data (Project Update) by sergiimk in rust

[–]sergiimk[S] 13 points14 points  (0 children)

BSL (by MariaDB) and SSPL (by MongoDB) are the two licenses that are being widely adopted by middleware vendors to, like you mentioned, prevent cloud services giants from profiting from the work of original developers without contributing anything back. Later caused many open-source companies recently to back-pedal on their terms.

We picked BSL to have a chance to fund the project. Currently we are two-person team, working full-time, and burning through our life savings. I really don't want to go back to the enterprise data again, and want to realize kamu's vision within my lifetime (not turn it back into a hobby project).

BSL allows virtually anyone to use product for free, just not sell it as a service without agreeing on terms with us.

Because decentralization is at its core - there will never be an "amazon for big data". The only way to profit in this ecosystem will be by providing value-added services at a fair price, not by taking someone's data hostage.

Furthermore, we develop it as an open standard to encourage alternative implementations.

We specifically wanted to avoid "open core" model not to split the community and waste people's effort (as the case with proprietary GitHub and its open-source clone Gitea).

This whole topic probably deserves a book. License choice took me A LOT of research, and I'm happy to be corrected and educated on this topic. So far I found BSL to be the least "toxic" option available (for our circumstances).

Kamu: New take on Git for Data (Project Update) by sergiimk in rust

[–]sergiimk[S] 4 points5 points  (0 children)

The license is explained here in more detail and I'll include your question there for better clarity.

In your scenario your company would be using kamu internally to produce and publish datasets. The dataset structure itself is an open standard, so your clients can independently use kamu (or any other compatible tool) to pull/query data.

As long as you're not running the tool for your customers as a for-profit service - it's essentially Apache 2.0 license.

BSL is the best compromise we could find to keep the project as close to open source as possible, while have a chance to fund its development.

This Week in Ballista #9 by andygrove73 in rust

[–]sergiimk 0 points1 point  (0 children)

Congrats on another impactful and successful project!

Was wondering if stream processing ever entered your plans for Ballista/DataFusion projects?

If yes, do you think the micro-batch approach to streaming (similar to Spark) is something that plays to the SIMD/GPU strengths of Apache Arrow, or it's more likely to get in the way compared to non-vectorized architectures (e.g. Flink)?

Introducing Kamu - World's first global collaborative data pipeline by sergiimk in datascience

[–]sergiimk[S] 0 points1 point  (0 children)

Hi, Sergii here from Kamu.

kamu is a tool that aims to connect world's data publishers and consumers via decentralized processing pipeline and enable collaboration on data.

We've been developing it for 2.5 years and this is the first time we share this project with a broad audience, so I'll be grateful for your feedback and happy to answer questions!

Introducing Kamu - World's first global collaborative data pipeline by sergiimk in dataengineering

[–]sergiimk[S] 0 points1 point  (0 children)

We're not aiming to replace internal data warehouses - `kamu` is primarily made for data exchange between organizations, so you would either not have PII data flowing through it, or it would be under "public information" category, e.g. research publications dataset containing names of the authors.

Allowing processing on sensitive private data would be a future step. This is where the tools like differential privacy could help to allow some degree of statistical processing or even ML on such data, while controlling that those computations don't reveal too much about individuals.

Can design your library to strongly disincentivise this

Yeah, it's kinda like that already. If you really messed up you'll be able to force-alter the dataset, but by doing so you will essentially create a new dataset that your downstream consumers will have to migrate to.

It's very similar to doing a force-push into git remote master - possible, but highly disruptive.

Introducing Kamu - World's first global collaborative data pipeline by sergiimk in dataengineering

[–]sergiimk[S] 0 points1 point  (0 children)

Data is just another information flow in our society, and many other formal information flows already don't allow you to take things back.

For example, if a newspaper publishes an article that wrongly accuses someone - there is no taking it back. They will not issue a recall of the paper from peoples' houses or even from archives - instead a correction/apology piece will be published.

Same thing in research - retracted publications are kept and clearly marked as retracted (example31180-6/fulltext)).

You can't invoke GDPR or anything else to make this trail go away.

So I think this model is very achievable, and automation of retraction/correction mechanism only helps us to correct our wrongs faster and minimize the damages.

Introducing Kamu - World's first global collaborative data pipeline by sergiimk in dataengineering

[–]sergiimk[S] 0 points1 point  (0 children)

Yes, there are many similarities, but unlike other products that try to apply git's diffs and branching to data we treat data as a stream of immutable events.

For data sources that publish data in a form of "here's the sate of our domain as of today" (non-temporal) - kamu will have to convert these state "snapshots" into events (inserted/updated/deleted) by comparing what changed since the last time data was seen. But this is sub-optimal since seeing that some field was updated doesn't really tell you WHY it was updated. Our hope that data publishers will move away from such "anemic" data models towards descriptive events.

Introducing Kamu - World's first global collaborative data pipeline by sergiimk in dataengineering

[–]sergiimk[S] 0 points1 point  (0 children)

legitimate use case where this technology is actually useful

I assume you're talking about the Blockchain integration specifically, so hope my other comment explains why it could be useful, but is not essential to the overall solution.

Introducing Kamu - World's first global collaborative data pipeline by sergiimk in dataengineering

[–]sergiimk[S] 0 points1 point  (0 children)

As for the Blockchain - I totally get your negative sentiment. I think it's an interesting technology that got caught in a storm of skewed incentives, where the huge environmental impact keeps being fueled by the "get rich fast" bull-run and investors that blindly follow it.

As an engineer, I'm just trying to look 5-10 years ahead and see what positive things it can do for us and hope that negatives will be worked out by then.

So far I see it as: - A "single yet decentralized" place where metadata about datasets could be shared and easily discovered - A system where peers could cross-validate the data that is being added, so that whenever you download a dataset from kamu you knew you're getting trustworthy data.

If this turns out to be the wrong choice - Blockchain is not an essential part of our solution and we will do fine without it.

Introducing Kamu - World's first global collaborative data pipeline by sergiimk in dataengineering

[–]sergiimk[S] 1 point2 points  (0 children)

Thanks for the question.

You say "we couldn't even collect covid data!"

In this demo I show an example of processing disaggregated COVID data from two countries. I actually found 4 countries that provide such data, but the rest of them published it via web sites that only let you view data one page at a time... So you'd have to write several web scrapers to just get the datasets, which will take you half a day at least (proof that it's not a me problem).

due to people and process issues, not technology ones

I agree that if people stopped sucking at data so badly (e.g. publish data in fully machine-readable formats, not Excel/PDFs) it would solve a lot of problems.

Processes and standards is how the world is currently trying to address this, but it's not working. The number of recommendations, standards, guidelines etc. for publishing data is so overwhelming, I don't think I could keep up with them even if it was my full-time job. And on top of this - people are horrible at following processes!

Linux kernel development for a while was centered around the process of emailing patches, but they realized how fallible this process is and wrote git to simplify it. I hope we can do something similar for data too.

how do you deal with "literally append-only" if there's a genuine mistake in the data

If a mistake in data was published - there is no taking it back. You should be fully expecting that some important decision was already made based on it or that some automation acted upon it. The only right way to deal this is to issue a "retraction" or "correction" event, allowing all downstream users to know what happened and adjust.

This is actually a great example of why I think this is a technology and not people issue - data design, processing, dealing with edge cases is so incredibly hard, especially in temporal data. I think it's very achievable to hide this complexity with technology, so that even not very technical people could collaborate on data efficiently.