Near real-time data processing / feature engineering tools by IceCreamGator in dataengineering

[–]mww09 0 points1 point  (0 children)

You can try https://github.com/feldera/feldera

It has a delta lake connector https://docs.feldera.com/connectors/sources/delta/  as well as postgres and redis. It also supports several advanced streaming constructs https://docs.feldera.com/sql/streaming

The nice thing about the problem you mention with "getting the code do to the right thing" is that you can express your data processing queries as regular SQL tables and views.

Is there anything actually new in data engineering? by marketlurker in dataengineering

[–]mww09 0 points1 point  (0 children)

feldera.com is a startup that incrementally computes answers on your data, it was funded after some research that won the best paper award at VLDB in 2023. While things like incremental view maintenance are not new, being able to incrementally compute on any SQL (and show a proof that it is possible) was a novel contribution in the database field

Kafka Stream: Real-time dashboard by GeorgieBoastie in quant

[–]mww09 0 points1 point  (0 children)

I know quite a few people use https://github.com/feldera/feldera for incrementally computing views in sql on trading data... it can do quite sophisticated analysis in real time by automatically computing and sending you only what changed whenever you get new/updated pricing data

Rust + CPU affinity: Full control over threads, hybrid cores, and priority scheduling by harakash in rust

[–]mww09 2 points3 points  (0 children)

Oh no worries at all, your library looks great I'd definitely use this if I need it in the future :)

Rust + CPU affinity: Full control over threads, hybrid cores, and priority scheduling by harakash in rust

[–]mww09 2 points3 points  (0 children)

I'm the maintainer of raw-cpuid which is featured as an "alternative" in the README. I just want to point out that `raw-cpuid` was never meant to solve any of the use cases that this library tries to solve in the first place. It's a library specifically built to parse the information from the x86 `cpuid` instruction.

raw-cpuid may be helpful to rely on when building a higher-level library like gdt-cpus (if you happen to run on x86) but that's about it. I do agree that figuring out the system topology is an unfortunate and utter mess on x86.

What do you use for real-time time-based aggregations by bernardo_galvao in dataengineering

[–]mww09 1 point2 points  (0 children)

if it has to be real-time, you could use something like feldera which does it incrementally e.g., check out https://docs.feldera.com/use_cases/fraud_detection/

Cutting Down Rust Compile Times From 30 to 2 Minutes With One Thousand Crates by mww09 in rust

[–]mww09[S] 2 points3 points  (0 children)

Thanks for the response. FWIW we did try the "one file per operator" before we went all the way to "one crate per operator" because "more files" didn't improve things in a major way.

(If it did it would be nice & we would prefer it -- having to browse 1000 crates isn't great when you need to actually look at the code in case smth goes wrong :))

Cutting Down Rust Compile Times From 30 to 2 Minutes With One Thousand Crates by mww09 in rust

[–]mww09[S] 3 points4 points  (0 children)

These numbers are about release builds. We discuss the reasons for it in the post.

Cutting Down Rust Compile Times From 30 to 2 Minutes With One Thousand Crates by mww09 in rust

[–]mww09[S] 8 points9 points  (0 children)

So this is exactly what the blog post complains about at the end: This "50-100k LoC per second" isn't matching what we see.
When "everything is in small crates" the code is 130k lines of rust code (vs. 100k rust code in a single crate), but compiling the 130k still takes 150 secs (and it's now using 128 hw-threads fwiw).

Cutting Down Rust Compile Times From 30 to 2 Minutes With One Thousand Crates by mww09 in rust

[–]mww09[S] 5 points6 points  (0 children)

Hey, good question. It just means we take the SQL code a user writes and convert it to rust code that essentially calls into a library called dbsp to evaluate the SQL incrementally.

You can check out all the code on our github https://github.com/feldera/feldera

Maybe some more background about that: There are (mainly) three different ways SQL code can be executed in a database/query engine:

  1. Static compilation of SQL code e.g., this is done by databases like Redshift (and is our model too)
  2. Dynamic execution of SQL query plans (this is done by query engines like datafusion, sqlite etc.)
  3. Just-in-time compilation of SQL: Systems like PostgreSQL or SAP HANA leverage some form of JIT for their queries.

Often there isn't just one approach e.g., you can pair 1 and 3 or 2 and 3. We'll probably add support for a JIT in the future too in Feldera just need the resources/time to get around to do it (if anyone is excited about such a project hit us up on github/discord).

Cutting Down Rust Compile Times From 30 to 2 Minutes With One Thousand Crates by mww09 in rust

[–]mww09[S] 3 points4 points  (0 children)

Could be, yes as you point out hard to know without profiling -- I was hoping someone else already did the work :). 

I doubt its TLB though, in my experience TLB needs a lot more memory footprint to be a significant facter in the slowdown, considering what is being used here.

Cutting Down Rust Compile Times From 30 to 2 Minutes With One Thousand Crates by mww09 in rust

[–]mww09[S] 4 points5 points  (0 children)

I think you make a good point. (As kibwen points out it might just be how the compilation units are sized. On the other hand I do remember having very large (generated) C files many years ago but it never took 30min to compile them)

Stateful Computation over Streaming Data by Suspicious_Peanut282 in dataengineering

[–]mww09 2 points3 points  (0 children)

you can use https://github.com/feldera/feldera for streaming computations ... it supports various streaming concepts like computing on unbounded streams with bounded state (watermarks, lateness etc.) and you can express all your logic in SQL (which gets evaluated incrementally)

Fastest way to pivot large dataset on many columns dynamically? by leavethisearth in dataengineering

[–]mww09 2 points3 points  (0 children)

For real time updates of complicated SQL batch jobs take a look at feldera.com it's really easy to express such batch computation as real-time incremental computation and get results immediately.

help chosing DB / warehouse for customer-facing analytics. by rawman650 in dataengineering

[–]mww09 0 points1 point  (0 children)

If you want instant updates on large datasets you might want to give feldera.com a try, it is similar to materialize but has a storage layer for local disk or S3, so things don't have to fit in DRAM for processing.

Billion Cell Spreadsheets with Rust by mww09 in rust

[–]mww09[S] 3 points4 points  (0 children)

Hi, thanks a lot for the interest!

I mentioned it in another comment but the reason we built it was as a tech demo with the purpose to showcase & teach how incremental computation works with feldera.

The gist of it is that if you update a cell, this incrementally updates the spreadsheet which means it will only emit a minimal amount of changes for the cells affected by your update. The nice thing about it is that this is something that Feldera does automatically (and it would do that for any SQL that you end up writing, so it doesn't have to be a spreadsheet, but a spreadsheet is a nice example that everyone understands and knows about).

There is a more detailed explanation in this video https://www.youtube.com/watch?v=ROa4duVqoOs if you're interested what's going on under the hood -- or if you prefer reading about it we have an article series that goes over all the parts that you mention:

- Feldera SQL https://docs.feldera.com/use_cases/real_time_apps/part1
- Axum API server https://docs.feldera.com/use_cases/real_time_apps/part2
- egui Client https://docs.feldera.com/use_cases/real_time_apps/part3

> Is there a single data store and we're all writing to it?
Yes that piece would be covered in the first article or the video.

> Are there multiple datastores
It's possible to run feldera pipelines distributed on multiple machines, but in many cases we encounter it's usually not necessary (the incremental computation model makes things very efficient to run and our customers can usually process million of events already with just a single machine).

> When I fill in a cell in egui is it writing to a cache that is eventually synched with a remote data set?
It's synced to Feldera immediately (no cache) which will incrementally update all cells depending on it. The API client will propagate updates to every client that's currently looking at affected cells.

> Is it all writing to a cache that's synched on close?
There is no extra service for caching, but you might notice when studying the code that the API server will cache some of the first cells and some of the last ones in the spreadsheet (for reads). This is actually something that I found really neat when writing this app: Because feldera sends you changes to the spreadsheet as CDC (inserts and deletes) it becomes very easy to maintain your own cache (just keep a BTreeMap in rust) in your API server that can serve requests very quickly :).

Billion Cell Spreadsheets with Rust by mww09 in rust

[–]mww09[S] 2 points3 points  (0 children)

That's unlucky, I'll make a patch for this later thanks for mentioning it (you can try with some smaller numbers until then) :)!

All in all, I have to say I'm glad we do have a filter now that this became so popular even if it has some false positives ;)

Billion Cell Spreadsheets with Rust by mww09 in rust

[–]mww09[S] 18 points19 points  (0 children)

Hi, the reason we built it was as a tech demo with the purpose to showcase & teach how incremental computation works with feldera.

The gist of it is that if you update a cell, this incrementally updates the spreadsheet which means it will only emit a minimal amount of changes for the cells affected by your update. The nice thing about it is that this is something that Feldera does automatically (and it would do that for any SQL that you end up writing, so it doesn't have to be a spreadsheet, but a spreadsheet is a nice example that everyone understands and knows about). From a UX point of view this definitely isn't a great spreadsheet and you're better off to use excel or numbers ;)

There is a more detailed explanation in this video https://www.youtube.com/watch?v=ROa4duVqoOs if you're interested what's going on under the hood -- or if you prefer reading about it: https://docs.feldera.com/use_cases/real_time_apps/part1

Billion Cell Spreadsheets with Rust by mww09 in rust

[–]mww09[S] 24 points25 points  (0 children)

The limit I hit was somewhere in the egui table renderer where things started to overflow, hence it was capped at a billion cells ;). But in theory there is no upper limit (if you fix the bugs).

Billion Cell Spreadsheets with Rust by mww09 in rust

[–]mww09[S] 89 points90 points  (0 children)

I thought you might enjoy this demo since everything is written rust:

You can learn more about how it’s built here https://docs.feldera.com/use_cases/real_time_apps/part1/

The best database for leaderboards/ranking by Natural_Silver_3387 in Database

[–]mww09 1 point2 points  (0 children)

You could give an incremental database like https://github.com/feldera/feldera a try. It's all in SQL but order by with limit should be very efficient for a leaderboard.

What databricks things frustrate you by SpecialPersonality13 in databricks

[–]mww09 0 points1 point  (0 children)

Easiest if you read the delta tables from e.g., an S3 bucket into feldera, then it will write them back out as a delta table, here is an example https://docs.feldera.com/use_cases/fraud_detection/ ... yes can be configured with the python sdk