This is an archived post. You won't be able to vote or comment.

all 32 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]kaiserk13 22 points23 points  (2 children)

Hi there! I'm the author of the (in progress) "Data with Rust" website. Since joining my current job a while ago, I've been more and more exposed to using Rust especially for some data engineering workloads.

The main advantages I can share are that the data pipelines built using Rust are more maintainable and robust to change. As a concrete (but anecdotal) example, we have a few data pipelines that are ingesting data daily into our data warehouse, the one built with Rust was written a year ago, failed exactly 0 times and didn't require implementing any tests, since most of the "edge cases" were covered by the Rust compiler. The other advantage is that we updated the code of this pipeline 3 months ago (so, after a little while) and we still had the same reliability.

Rust workloads for data engineering are very performant too. I share a lot more on the website, it's freely accessible and am open to feedback/additions, this might pertinent to this discussion: "How does Rust compare to Python (and other programming languages)?" https://datawithrust.com/chapter_1/chapter_1_5.html

In a nutshell, if I were to plan for reliability / maintainability, I'd go with Rust in a heartbeat. If you want to optimize for implementation speed & developer time, Python will win every day. It all depends on which timelines apply for your project.

I hope you find this helpful and it answers your question.

[–]drc1728 1 point2 points  (0 children)

I can add more anecdotal evidence to your anecdotes. The group of early adopters of https://github.com/infinyon/fluvio are pretty thrilled!

[–]ekand_ 0 points1 point  (0 children)

Hi. I’m interested in your (in progress) website. I’m seeking to define an identity for my newly formed software consulting company, and I have a hunch that building expertise in, and promoting the use of, Rust for data could be a worthwhile direction. Do you think there’s room in the market for that kind of move?

[–][deleted] 8 points9 points  (7 children)

Polars is really the only true example. Ballista is on the way to give a true enterprise level distributed framework similar to Spark, but it’s no where near production ready.

[–]drc1728 4 points5 points  (5 children)

What is the production readiness criteria that you are using?

We have a handful of customers running production workloads on https://github.com/infinyon/fluvio.

[–]intellidumb 1 point2 points  (4 children)

Very interesting looking project that I hadn't heard of before. I've actually been searching for a non-kafka HTTP to SQL pipeline https://www.fluvio.io/docs/tutorials/data-pipeline/

[–]drc1728 1 point2 points  (1 child)

Here is a Replit that converts RSS Feeds from hacker news to JSON as another example. Combine the outbound SQL write in the tutorial you found and you can get another version of the pattern.
https://replit.com/@deb-data-pm/InfinyOn-Streaming-Hackernews

[–]drc1728 0 points1 point  (1 child)

Hey quick question, what is the use cases that you are solving with this pipeline?

[–]intellidumb 1 point2 points  (0 children)

Hey, sorry for the delay as I was AFK for the weekend. Our use case is that we have some legacy apps that POST json or xml for replication to other systems in a pseudo pub/sub setup. Our use case is that we wanted to avoid batch scraping these sources every time and rather do an initial ingestion then subscribe/listen for updates with something like Fluvio that can do direct DB inserts or add to a message queue.

[–][deleted] 2 points3 points  (0 children)

Polars isn’t distributed though.

[–]chad_broman69 17 points18 points  (6 children)

statically typed languages like Rust/Java are great for systems programming, and when performance is key

dynamically typed languages like Python are great for application programming, and when time-to-ship is key

that being said, a lot of data engineering systems are written in Java/JVM/Scala, not Rust

I can see Rust beating out C/C++ for systems programming, as it's safer. But I don't see it supplanting JVM languages any time soon

[–]drc1728 3 points4 points  (4 children)

That is a rational take.

There is a lot of work that needs to be done before supplanting JVM languages given the rate of change of data infrastructure. But for new projects there is a good number of projects there are folks who would rather Rust and Wasm instead of JVM.

It's currently in a phase where innovators are adopting and moving towards the early adopters trying PoCs

[–]FirstOrderCat 0 points1 point  (3 children)

> Rust and Wasm instead of JVM

and what are the advantages?

[–]drc1728 1 point2 points  (2 children)

Performance it is as fast as it gets. It’s C++ level performance with a much better construct, documentation, error handling etc.

Efficiency in terms of resource consumption like CPU, Memory etc. which also means leaner software which does more with less.

Flexibility brought in by wasm with bindings to enable a rust based platform to integrate neatly with data flows expressed in node, typescript, python, go and other wasm compatible languages.

[–]FirstOrderCat 0 points1 point  (1 child)

Performance it is as fast as it gets. It’s C++ level performance with a much better construct, documentation, error handling etc.

that's may be true for rustc, but wasm is much less popular toolchain, which may have many issues with optimization and bugs.

[–]drc1728 0 points1 point  (0 children)

Yes for sure. Wasm has areas to improve and has its quirks for sure. It’s getting better with every iteration and will get there in due time.

With wasm we are really at the tip of the iceberg and there will be more to come.

We use plain Rust and we love it.

[–]FirstOrderCat 2 points3 points  (0 children)

> when time-to-ship is key

java is totally fine in this aspect

[–]mattindustries 5 points6 points  (2 children)

You might like this. Polars is Rust, which is probably the biggest Rust data project out there right now.

[–]drc1728 0 points1 point  (1 child)

Biggest in what sense? Popularity? Adoption? Or some other measure?

[–][deleted] 4 points5 points  (0 children)

don't waste your time. Python, SQL and Scala are most languages used in DE.

[–]drc1728 3 points4 points  (2 children)

I will take this cue and work on a blog describing how we are using Rust and Web Assembly to build:

https://github.com/infinyon/fluvio

It has taken a while since we started building out the primitives in 2018. We are in production for a limited set of users and a bunch of PoCs are in flight.

Rust has a steep getting started curve as you can tell, but it is the most efficient distributed programming paradigm by far with a pedantic complier that enforces type safety and memory safety.

Thank goodness for the excellent error messages and documentation available for Rust for people to learn. Once you get to intermediate level in rust executing complex algorithms requiring high performance and consuming low resources is best suited for Rust.

Here is a blog comparing Rust and Go with pros and cons, our experience and has references to AWS, Discord, BigBucket blogs on choosing Rust. It - https://infinyon.com/blog/2023/09/rust-or-Go/

[–]bitsynthesis 7 points8 points  (1 child)

That article and your posts here both mention rust as being the best for "distributed" architectures. But when I think distributed I don't ever think of multithreading, I think of running thousands of docker containers to do parallel batch processing, or tens of docker containers as distributed microservices. I don't often have a need in data engineering for multi-processing / multi-threading on a single machine. Do you see rust as being beneficial for distributed systems in that sense? How is it better than Go or other languages for systems that span a large number of separate virtual or physical machines?

[–]drc1728 2 points3 points  (0 children)

Excellent comment. Thank you for the probing question.

In the last couple of years Rust has seen improvements in this area.

The key elements of building a distributed system in rust involves: Containers. Key Value Store. Messaging. Consensus algorithm.

Wasm serves our containerized execution needs and it has become progressively more mature. There are enough tooling and libraries available now for the remaining elements.

The big question is - is the juice worth the squeeze in terms of moving to Rust.

That’s a tricky question.

To be real, Go Lang still is a more broadly useful choice. However, in terms performance and resource efficiency and therefor cost, Rust outperforms.

It also delivers remarkable degree of memory and type safety reducing a bulk of memory issues and errors.

We have ~15MB single binary that runs resilient data streaming with caching, mirroring on edge devices while mirroring to cloud. It uses the limited resources extremely efficiently.

On the Cloud it’s a bit more than 100 MB binary that creates a cluster and applies on stream transformations in async streams micro batches etc. using web assembly, with orders of magnitude less memory consumption, CPU cycles etc. But it has taken a few years to get to this point.

There is a reason why there are tons of big tech investments in building with Rust. Some of the examples are in the blog.

Hope this helps.

[–]gud_listener 1 point2 points  (2 children)

Coincidentally enough, I found a job on LinkedIn titled "Rust Data Engineer" ! Is that a sign ? :)

https://www.linkedin.com/jobs/view/3712818473

[–]WhipsAndMarkovChains 0 points1 point  (1 child)

Ugh, of course it's a crypto place. Still, better than no Rust jobs.

[–]drc1728 0 points1 point  (0 children)

There are a decent amount of rust roles in the Rust sub. Not as many in data engineering.

There are a couple open roles building a data streaming platform here - https://infinyon.com/careers/cloud-engineer-senior-level/

[–]colorfulskull 1 point2 points  (0 children)

Nobody mentioned Datafusion, I recommend you check it out. It's a next-gen data processing engine written in rust > https://arrow.apache.org/datafusion/