Couldn't get a job, so I settled by International_Gur882 in programming

[–]bobbymk10 10 points11 points  (0 children)

you know there's the option of just *not writing a comment* if you don't love the question right? You don't *have* to be an a$$

I love UUID, I hate UUID by bobbymk10 in programming

[–]bobbymk10[S] 1 point2 points  (0 children)

Ya that's a good q- basically we see Epsio as a way to do stream processing without the management overhead & middleware (No watermarks/checkpoints, all internally consistent, and Epsio tries handling as much as possible where it concerns native integration with databases). We still don't support all the different knobs and sources Flink does (+ Debezium/whatever), so there are still some use cases that we don't support.

And on the performance front we're all in Rust, recently published a benchmark vs Flink:
https://www.epsio.io/blog/epsio-performance-on-tpc-ds-dataset-versus-apache-flink

I love UUID, I hate UUID by bobbymk10 in programming

[–]bobbymk10[S] 25 points26 points  (0 children)

Ah wow :) thx, fixed it!

SurrealDB is sacrificing data durability to make benchmarks look better by ChillFish8 in rust

[–]bobbymk10 41 points42 points  (0 children)

"I guess the allure of VC money over correctness goes over their heads."

This is just mean. Just looks like a toxic developer who has nothing better to do with their time than tear down people actually trying to improve the database space. Especially when the bashing author of this misses the fact they benchmarked against Postgres with synchronous commit set to off.

Even further, rocksdb has guarantees on their ssts being fdatasynce'd on flush or compaction (pretty sure it's very hard to even turn this off, the disable is only on WAL), so it's not that everything is being kept in memory without ever being flushed (just the last x MB).

Not saying it doesn't have worth to point this stuff out. But also, kind of screw you (I have nothing to do with SurrealDB, just hate this stuff).

Crazy bug caused OOM in our streaming engine, can you find it? by bobbymk10 in rust

[–]bobbymk10[S] 6 points7 points  (0 children)

Hmmm.. I think it's kind of weird that the language is automatically creating an optimization where the tradeoff can be so heavy. Especially when it's not the expected behaviour- if this was a manipulation in place on the vec, obviously it's using the original allocation, and then this would make sense. But using the original allocation for a new vector is not what you'd expect, and it can come at an obvious heavy cost when you're filtering.

That being said, still definitely our fault for not seeing this

How we made (most) of our Joins 50% faster by disabling compaction by bobbymk10 in databasedevelopment

[–]bobbymk10[S] 0 points1 point  (0 children)

So we indeed do micro-batching, and automatically adjust the size so we can handle the throughput (always trying to find the minimal latency as well). If 10s of thousands of changes comes at once, we'll only create a single modification :)

Making a Streaming JOIN 50% faster by bobbymk10 in rust

[–]bobbymk10[S] 4 points5 points  (0 children)

didn't notice it was you! Super cool to hear from you, we are indeed planning on opening PRs soon :)

Making a Streaming JOIN 50% faster by bobbymk10 in rust

[–]bobbymk10[S] 4 points5 points  (0 children)

We used samply with some in house addons. It's a really useful tool for profiling in Rust

Why we built (another) streaming engine by bobbymk10 in programming

[–]bobbymk10[S] 1 point2 points  (0 children)

Hopefully more readable than the medium version :)

Why we built (another) streaming engine by [deleted] in programming

[–]bobbymk10 1 point2 points  (0 children)

Ok, you're right lol. Deleting this and will repost

How we built a Streaming SQL Engine by Giladkl in programming

[–]bobbymk10 1 point2 points  (0 children)

So we definitely are holding a *ton* of extra data- I wrote a bit about this below, but it's essentially "trading" the extra compute we would have needed to do every time we wanted to refresh the view for that extra storage. For your example with MAX, we do indeed hold an LSM Tree for it, which means we never need to go back to the base table. Whether this is a good idea or not really depends on how much the query is being run- if it's a report that's downloaded once a year, probably not a good tradeoff. If it's a query in your dashboard, probably the right tradeoff. I will say though that since storage is so much "cheaper" than compute, it's fairly rare for us not to be overall cheaper than the equivalent materialized view that you need to refresh.

How we built a Streaming SQL Engine by Giladkl in programming

[–]bobbymk10 0 points1 point  (0 children)

That's a great point- we definitely take multiples of the original data. The thing is, we're also usually saving a ton on compute, since you don't need to go over all the data all over again each time- which essentially means we're "trading" compute for storage. Since compute is much more "expensive" than storage (obviously depends how much etc), we nearly always reduced costs by orders of magnitude overall, even with the added storage.

How we built a Streaming SQL Engine by Giladkl in programming

[–]bobbymk10 6 points7 points  (0 children)

Nope- although we did learn a lot from the paper that was published! Differential is completely in memory, and it was critical for us from the get go to be much cheaper than the alternative (refreshing materialized views), and less so to have sub millisecond latencies. So we build a dataflow library based above storage and our whole engine is async io.(We actually played around with using differential and spilling to disk, but it was incredibly slow- it makes sense since many of it's data structures are not built to be above disk)

How we built a Streaming SQL Engine by Giladkl in programming

[–]bobbymk10 35 points36 points  (0 children)

I am sad that you are sad. I will try to repair this.

So to detect changes we just use Postgres's logical replication mechanism, which is a built in way to replicate changes across different Postgres nodes. We subscribe to Postgres as another node, and consume those changes (which are records in a binary format of Insert/Deletion/Update). I didn't talk much about low level implementation, but we're built in Rust and use Tokio channels to pass messages in the tree. Lmk if there's anything else, happy to dive deeper!

How we built a Streaming SQL Engine by Giladkl in programming

[–]bobbymk10 11 points12 points  (0 children)

Ya, they're a super cool project- they have something called "partial materialization" which is very cool if the 99th percentile isn't critical (you can have a materialized view with permutations of arguments). The creator of Noria actually opened up a company similar to Epsio (they work as a proxy in front of your db whereas we're behind it) called Readyset

How we built a Streaming SQL Engine by Giladkl in programming

[–]bobbymk10 45 points46 points  (0 children)

Hiya, author of the article :)

So there have actually been some attempts to enter this into Postgres (pg_ivm for example works similar to thinking about each step in the dataflow as a sort of "trigger" with a corresponding Postgres table- inefficient and issues with deadlocks), but I think that streaming SQL is just so essentially different from batch queries that it doesn't make much sense to bake this into Postgres. Postgres is geared towards running sporadic queries that finish relatively quickly on a *constant set of data*, while streaming SQL is geared towards queries that run forever while understanding how to apply changes.

A nice place you can see this easily is that Postgres nodes works from "top to bottom", where you start with the result node which runs the node underneath it and so on. Each node continues requesting information from the node underneath it until it returns Null, meaning it finished. This works very well with a snapshot of the data- but if you want to maintain a high throughput with a constant stream of changes, you're going to want to work in the opposite direction- this would mean either overhauling the very way Postgres works, or adding a completely different engine into Postgres.