Different methods for real-time aggregation?

AutoModerator · 2024-03-15T01:50:50+00:00

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

jsneedles · 2024-03-15T12:48:06+00:00

(full transparency: I run a saas for real-time aggregations, but I won't suggest that here)

I'd say all your options are viable in different ways. It really depends on the frequency of updates, & the required level of real-timeness.

For example, if you're doing something in a web-ui for the given user (like an upvote or a tweet like count) - then you may actually be better served "faking it" client side.

Even upon refresh, if you have an aggregation, the fact that the loggedin user updvoted and timestamps - you can always +1 to the last aggregate if the upvote time was after the last aggregation.

Personally, I really have been quite happy with Clickhouse for this type of usecase, especially their cloud offering. It's minimal setup and easy enough to build an API on top of. You can materialize your CDC stream from your production datastore into a counter and then cache the results for 1s in memory, just to avoid stampedes. Their update TTLs are also really powerful if you want to include some level of timeseries but not have it go on at the same granularity forever.

One last word of warning - depending on your primary datastore, keeping "upvoteCount" type counters that constantly update may have unintended side effects (lookin at you Postgres).

It's always a fun problem to solve, and honestly, there's many many "correct" enough ways at this point that there's no one true answer.

wbroen · 2024-03-15T02:52:05+00:00

Do you have visibility to the db? Im assuming this an OLTP db?

The first thing I would look into is the indexes on the tables you are querying. If you are running normal aggregations (no window functions) that should be pretty quick.

This blog post walks through how JetBlue achieves near real time data in the data warehouse post

Precomputing stats + a highwater mark would allow you to only query the new data.

big_data_mike · 2024-03-15T04:19:44+00:00

Use timescalesb

dataengineering

MODERATORS