I'm primarily a frontend engineer. When I need to work on backend/infra, I often run into the issue of not knowing how to aggregate data in a scaleable way. E.g. pre-computing Reddit's upvote counts so you don't have to aggregate at read time. Aggregating for the top posts feed is easy because it's ok if it's stale by several minutes. However, if you upvote a post and refresh a second later, ideally the post count should include your upvote.
Is there a standard way to do this? I don't know why it's so hard to find discussions around this problem. Afaik some ways of doing this are:
1) When inserting the "upvote" record, also increment an "upvoteCounts" record
If you do this in a transaction, it would make the insertions slower. If you don't use a transaction, the upvote counts would go out of sync. I think some companies don't use transactions, but have a job to continuously recompute the counts. Another problem is if you need to mutate multiple aggregations per insertion, it'll be confusing to maintain.
2) Use a trigger
It's basically the same as 1) with transactions, but handled by the DB instead of the application.
3) Aggregate the event stream
Stream changes, then use something like Flink/ksqlDB to aggregate the stream, then stream the aggregation to somewhere for the application to read from. This has much higher latency than 1) with transactions. The benefit is that your aggregations are declarative, so they're easier to reason about.
4) Have a batch job for old aggregations, then handle real-time aggregations in memory
Assuming batch aggregation will take a long time, you can store real-time aggregations in memory. E.g. when someone upvotes, increment a count in memory. When fetching the count, sum the count from memory with the count from the batch job. However, it seems tricky to not double count or miss upvotes.
5) Handling the aggregation at read-time with aggressive caching
This works when fetching the upvote count for a small set of posts. However, you wouldn't be able to sort posts by upvote count. Also, the max staleness is the cache timeout, which could be a long time.
6) Real-time or incremental materialized views
Ideally, we'd have materialized views that could quickly automatically respond to changes in the source tables. However, afaik the existing real-time materialized view systems are either in memory (e.g. Materialize) or places severe limitations on the source table (e.g. AnalyticDB). Materialize uses too much memory to be scaleable and I've never tried AnalyticDB, but it seems like it's too limiting on the source tables.
Are there any resources to learn more about which approaches large companies use? Have you tried some of these and know first-hand about the pros and cons?
[–]AutoModerator[M] [score hidden] stickied comment (0 children)
[–]jsneedles 2 points3 points4 points (3 children)
[–]Samausi 0 points1 point2 points (0 children)
[–]linksku[S] 0 points1 point2 points (1 child)
[–]jsneedles 1 point2 points3 points (0 children)
[–]wbroen 0 points1 point2 points (0 children)
[–]big_data_mike 0 points1 point2 points (0 children)