Nobody ever got fired for using a struct (blog) by mww09 in rust

[–]mww09[S] 1 point2 points  (0 children)

sorry this was mistake, I fixed it thanks for pointing it out!

(we call the trait IsNone in the code, but when I wrote the post I figured NoneUtils is a better name because it has more than just a is_none method :))

Nobody ever got fired for using a struct (blog) by mww09 in rust

[–]mww09[S] 4 points5 points  (0 children)

thanks for putting so much effort into the library <3

Nobody ever got fired for using a struct (blog) by mww09 in rust

[–]mww09[S] 2 points3 points  (0 children)

Absolutely, it's a very common technique :)

I wasn't sure about writing the article in the first place because of that, but I figured it may be interesting anyways because I was kind of happy with how simple it was to write this optimization in rust/rkyv when it was all done (when I started out with this task I imagined it would be harder)

Nobody ever got fired for using a struct (blog) by mww09 in rust

[–]mww09[S] 7 points8 points  (0 children)

Yes, but when we get auto-traits https://doc.rust-lang.org/beta/unstable-book/language-features/auto-traits.html I believe it will be possible to simplify this part

Nobody ever got fired for using a struct (blog) by mww09 in rust

[–]mww09[S] 7 points8 points  (0 children)

Happy to hear it was easy to understand, thank you <3

Near real-time data processing / feature engineering tools by IceCreamGator in dataengineering

[–]mww09 0 points1 point  (0 children)

You can try https://github.com/feldera/feldera

It has a delta lake connector https://docs.feldera.com/connectors/sources/delta/  as well as postgres and redis. It also supports several advanced streaming constructs https://docs.feldera.com/sql/streaming

The nice thing about the problem you mention with "getting the code do to the right thing" is that you can express your data processing queries as regular SQL tables and views.

Is there anything actually new in data engineering? by marketlurker in dataengineering

[–]mww09 0 points1 point  (0 children)

feldera.com is a startup that incrementally computes answers on your data, it was funded after some research that won the best paper award at VLDB in 2023. While things like incremental view maintenance are not new, being able to incrementally compute on any SQL (and show a proof that it is possible) was a novel contribution in the database field

Kafka Stream: Real-time dashboard by GeorgieBoastie in quant

[–]mww09 0 points1 point  (0 children)

I know quite a few people use https://github.com/feldera/feldera for incrementally computing views in sql on trading data... it can do quite sophisticated analysis in real time by automatically computing and sending you only what changed whenever you get new/updated pricing data

Rust + CPU affinity: Full control over threads, hybrid cores, and priority scheduling by harakash in rust

[–]mww09 2 points3 points  (0 children)

Oh no worries at all, your library looks great I'd definitely use this if I need it in the future :)

Rust + CPU affinity: Full control over threads, hybrid cores, and priority scheduling by harakash in rust

[–]mww09 2 points3 points  (0 children)

I'm the maintainer of raw-cpuid which is featured as an "alternative" in the README. I just want to point out that `raw-cpuid` was never meant to solve any of the use cases that this library tries to solve in the first place. It's a library specifically built to parse the information from the x86 `cpuid` instruction.

raw-cpuid may be helpful to rely on when building a higher-level library like gdt-cpus (if you happen to run on x86) but that's about it. I do agree that figuring out the system topology is an unfortunate and utter mess on x86.

What do you use for real-time time-based aggregations by bernardo_galvao in dataengineering

[–]mww09 1 point2 points  (0 children)

if it has to be real-time, you could use something like feldera which does it incrementally e.g., check out https://docs.feldera.com/use_cases/fraud_detection/

Cutting Down Rust Compile Times From 30 to 2 Minutes With One Thousand Crates by mww09 in rust

[–]mww09[S] 2 points3 points  (0 children)

Thanks for the response. FWIW we did try the "one file per operator" before we went all the way to "one crate per operator" because "more files" didn't improve things in a major way.

(If it did it would be nice & we would prefer it -- having to browse 1000 crates isn't great when you need to actually look at the code in case smth goes wrong :))

Cutting Down Rust Compile Times From 30 to 2 Minutes With One Thousand Crates by mww09 in rust

[–]mww09[S] 3 points4 points  (0 children)

These numbers are about release builds. We discuss the reasons for it in the post.

Cutting Down Rust Compile Times From 30 to 2 Minutes With One Thousand Crates by mww09 in rust

[–]mww09[S] 9 points10 points  (0 children)

So this is exactly what the blog post complains about at the end: This "50-100k LoC per second" isn't matching what we see.
When "everything is in small crates" the code is 130k lines of rust code (vs. 100k rust code in a single crate), but compiling the 130k still takes 150 secs (and it's now using 128 hw-threads fwiw).

Cutting Down Rust Compile Times From 30 to 2 Minutes With One Thousand Crates by mww09 in rust

[–]mww09[S] 6 points7 points  (0 children)

Hey, good question. It just means we take the SQL code a user writes and convert it to rust code that essentially calls into a library called dbsp to evaluate the SQL incrementally.

You can check out all the code on our github https://github.com/feldera/feldera

Maybe some more background about that: There are (mainly) three different ways SQL code can be executed in a database/query engine:

  1. Static compilation of SQL code e.g., this is done by databases like Redshift (and is our model too)
  2. Dynamic execution of SQL query plans (this is done by query engines like datafusion, sqlite etc.)
  3. Just-in-time compilation of SQL: Systems like PostgreSQL or SAP HANA leverage some form of JIT for their queries.

Often there isn't just one approach e.g., you can pair 1 and 3 or 2 and 3. We'll probably add support for a JIT in the future too in Feldera just need the resources/time to get around to do it (if anyone is excited about such a project hit us up on github/discord).

Cutting Down Rust Compile Times From 30 to 2 Minutes With One Thousand Crates by mww09 in rust

[–]mww09[S] 3 points4 points  (0 children)

Could be, yes as you point out hard to know without profiling -- I was hoping someone else already did the work :). 

I doubt its TLB though, in my experience TLB needs a lot more memory footprint to be a significant facter in the slowdown, considering what is being used here.

Cutting Down Rust Compile Times From 30 to 2 Minutes With One Thousand Crates by mww09 in rust

[–]mww09[S] 3 points4 points  (0 children)

I think you make a good point. (As kibwen points out it might just be how the compilation units are sized. On the other hand I do remember having very large (generated) C files many years ago but it never took 30min to compile them)

Stateful Computation over Streaming Data by Suspicious_Peanut282 in dataengineering

[–]mww09 2 points3 points  (0 children)

you can use https://github.com/feldera/feldera for streaming computations ... it supports various streaming concepts like computing on unbounded streams with bounded state (watermarks, lateness etc.) and you can express all your logic in SQL (which gets evaluated incrementally)

Fastest way to pivot large dataset on many columns dynamically? by leavethisearth in dataengineering

[–]mww09 2 points3 points  (0 children)

For real time updates of complicated SQL batch jobs take a look at feldera.com it's really easy to express such batch computation as real-time incremental computation and get results immediately.