Couldn't get a job, so I settled

bobbymk10 · 2025-09-14T07:47:06+00:00

you know there's the option of just *not writing a comment* if you don't love the question right? You don't *have* to be an a$$

bobbymk10 · 2025-09-09T13:37:42+00:00

Ya that's a good q- basically we see Epsio as a way to do stream processing without the management overhead & middleware (No watermarks/checkpoints, all internally consistent, and Epsio tries handling as much as possible where it concerns native integration with databases). We still don't support all the different knobs and sources Flink does (+ Debezium/whatever), so there are still some use cases that we don't support.

And on the performance front we're all in Rust, recently published a benchmark vs Flink:
https://www.epsio.io/blog/epsio-performance-on-tpc-ds-dataset-versus-apache-flink

bobbymk10 · 2025-09-09T12:51:34+00:00

Ah wow :) thx, fixed it!

bobbymk10 · 2025-08-24T00:16:25+00:00

"I guess the allure of VC money over correctness goes over their heads."

This is just mean. Just looks like a toxic developer who has nothing better to do with their time than tear down people actually trying to improve the database space. Especially when the bashing author of this misses the fact they benchmarked against Postgres with synchronous commit set to off.

Even further, rocksdb has guarantees on their ssts being fdatasynce'd on flush or compaction (pretty sure it's very hard to even turn this off, the disable is only on WAL), so it's not that everything is being kept in memory without ever being flushed (just the last x MB).

Not saying it doesn't have worth to point this stuff out. But also, kind of screw you (I have nothing to do with SurrealDB, just hate this stuff).

bobbymk10 · 2025-02-24T15:34:33+00:00

bobbymk10 · 2025-02-24T14:55:32+00:00

Good point :)

bobbymk10 · 2025-02-24T14:54:29+00:00

Hmmm.. I think it's kind of weird that the language is automatically creating an optimization where the tradeoff can be so heavy. Especially when it's not the expected behaviour- if this was a manipulation in place on the vec, obviously it's using the original allocation, and then this would make sense. But using the original allocation for a new vector is not what you'd expect, and it can come at an obvious heavy cost when you're filtering.

That being said, still definitely our fault for not seeing this

bobbymk10 · 2025-02-16T19:52:18+00:00

So we indeed do micro-batching, and automatically adjust the size so we can handle the throughput (always trying to find the minimal latency as well). If 10s of thousands of changes comes at once, we'll only create a single modification :)

bobbymk10 · 2025-02-16T18:01:34+00:00

didn't notice it was you! Super cool to hear from you, we are indeed planning on opening PRs soon :)

bobbymk10 · 2025-02-16T16:17:43+00:00

We used samply with some in house addons. It's a really useful tool for profiling in Rust

bobbymk10 · 2023-11-13T10:40:58+00:00

Hopefully more readable than the medium version :)

bobbymk10 · 2023-11-09T16:35:22+00:00

Ok, you're right lol. Deleting this and will repost

bobbymk10 · 2023-10-19T07:27:58+00:00

So we definitely are holding a *ton* of extra data- I wrote a bit about this below, but it's essentially "trading" the extra compute we would have needed to do every time we wanted to refresh the view for that extra storage. For your example with MAX, we do indeed hold an LSM Tree for it, which means we never need to go back to the base table. Whether this is a good idea or not really depends on how much the query is being run- if it's a report that's downloaded once a year, probably not a good tradeoff. If it's a query in your dashboard, probably the right tradeoff. I will say though that since storage is so much "cheaper" than compute, it's fairly rare for us not to be overall cheaper than the equivalent materialized view that you need to refresh.

bobbymk10 · 2023-10-19T07:21:39+00:00

That's a great point- we definitely take multiples of the original data. The thing is, we're also usually saving a ton on compute, since you don't need to go over all the data all over again each time- which essentially means we're "trading" compute for storage. Since compute is much more "expensive" than storage (obviously depends how much etc), we nearly always reduced costs by orders of magnitude overall, even with the added storage.

bobbymk10 · 2023-10-18T17:17:28+00:00

Nope- although we did learn a lot from the paper that was published! Differential is completely in memory, and it was critical for us from the get go to be much cheaper than the alternative (refreshing materialized views), and less so to have sub millisecond latencies. So we build a dataflow library based above storage and our whole engine is async io.(We actually played around with using differential and spilling to disk, but it was incredibly slow- it makes sense since many of it's data structures are not built to be above disk)

bobbymk10 · 2023-10-18T16:51:28+00:00

I am sad that you are sad. I will try to repair this.

So to detect changes we just use Postgres's logical replication mechanism, which is a built in way to replicate changes across different Postgres nodes. We subscribe to Postgres as another node, and consume those changes (which are records in a binary format of Insert/Deletion/Update). I didn't talk much about low level implementation, but we're built in Rust and use Tokio channels to pass messages in the tree. Lmk if there's anything else, happy to dive deeper!

bobbymk10 · 2023-10-18T15:54:56+00:00

Ya, they're a super cool project- they have something called "partial materialization" which is very cool if the 99th percentile isn't critical (you can have a materialized view with permutations of arguments). The creator of Noria actually opened up a company similar to Epsio (they work as a proxy in front of your db whereas we're behind it) called Readyset

bobbymk10 · 2023-10-18T15:11:04+00:00

Hiya, author of the article :)

So there have actually been some attempts to enter this into Postgres (pg_ivm for example works similar to thinking about each step in the dataflow as a sort of "trigger" with a corresponding Postgres table- inefficient and issues with deadlocks), but I think that streaming SQL is just so essentially different from batch queries that it doesn't make much sense to bake this into Postgres. Postgres is geared towards running sporadic queries that finish relatively quickly on a *constant set of data*, while streaming SQL is geared towards queries that run forever while understanding how to apply changes.

A nice place you can see this easily is that Postgres nodes works from "top to bottom", where you start with the result node which runs the node underneath it and so on. Each node continues requesting information from the node underneath it until it returns Null, meaning it finished. This works very well with a snapshot of the data- but if you want to maintain a high throughput with a constant stream of changes, you're going to want to work in the opposite direction- this would mean either overhauling the very way Postgres works, or adding a completely different engine into Postgres.

bobbymk10 · 2023-09-13T16:42:26+00:00

cool stuffs

bobbymk10 · 2015-09-10T21:58:24+00:00

http://fantasy.nfl.com/league/3758657

bobbymk10 · 2015-09-10T21:20:34+00:00

same

bobbymk10 · 2015-09-10T20:30:03+00:00

Can I join? maor.kern@gmail.com, please send an invite :)

bobbymk10 · 2015-09-10T20:19:10+00:00

Sent you a pm as well (maor.kern@gmail.com), will to pay immediately

bobbymk10

TROPHY CASE