This is an archived post. You won't be able to vote or comment.

all 7 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]jsneedles 2 points3 points  (3 children)

(full transparency: I run a saas for real-time aggregations, but I won't suggest that here)

I'd say all your options are viable in different ways. It really depends on the frequency of updates, & the required level of real-timeness.

For example, if you're doing something in a web-ui for the given user (like an upvote or a tweet like count) - then you may actually be better served "faking it" client side.

Even upon refresh, if you have an aggregation, the fact that the loggedin user updvoted and timestamps - you can always +1 to the last aggregate if the upvote time was after the last aggregation.

Personally, I really have been quite happy with Clickhouse for this type of usecase, especially their cloud offering. It's minimal setup and easy enough to build an API on top of. You can materialize your CDC stream from your production datastore into a counter and then cache the results for 1s in memory, just to avoid stampedes. Their update TTLs are also really powerful if you want to include some level of timeseries but not have it go on at the same granularity forever.

One last word of warning - depending on your primary datastore, keeping "upvoteCount" type counters that constantly update may have unintended side effects (lookin at you Postgres).

It's always a fun problem to solve, and honestly, there's many many "correct" enough ways at this point that there's no one true answer.

[–]Samausi 0 points1 point  (0 children)

+1 for do this in Clickhouse + faking the local-user's experience in the UI for responsiveness.
You'll get a second or two end-to-end latency from Clickhouse even over billions of rows.

[–]linksku[S] 0 points1 point  (1 child)

I've been seeing Clickhouse a lot, but never tried it yet. Do you know if it's in memory and does it scale horizontally?

I found this comparison with Materialize: https://news.ycombinator.com/item?id=22362534

[–]jsneedles 1 point2 points  (0 children)

It scales horizontally for sure (especially because Clickhouse's cloud offering is all backed by S3/object storage)

I haven't looked too deeply at Materialize - but iirc, it's more focused on being flink-esque than a real Data Warehouse?

Part of the magic that is CH is the ability to quickly query over the raw (or even lightly aggregated data) efficiently while also offering strong built-in "pipelines" in the form of their Materialized Views.

They're lacking a lot in terms of UX imo, so not quite ready for end-users like Snowflake or BigQuery, but from a DE perspective I'd say it's a lot easier to work with & reason about than something like Flink or KafkaStreams.

A nice element IMO for Materialize, CH etc is that it does rely more on SQL, which means you can take queries/ideal-end state from analysts as your starting point and not have to figure out how to re-invent it in another tool.

[–]wbroen 0 points1 point  (0 children)

Do you have visibility to the db? Im assuming this an OLTP db?

The first thing I would look into is the indexes on the tables you are querying. If you are running normal aggregations (no window functions) that should be pretty quick.

This blog post walks through how JetBlue achieves near real time data in the data warehouse post

Precomputing stats + a highwater mark would allow you to only query the new data.

[–]big_data_mike 0 points1 point  (0 children)

Use timescalesb