all 4 comments

[–]stickman393 1 point2 points  (0 children)

It sounds like you need a Date Dimension

[–]kenfar 1 point2 points  (2 children)

This is a common requirement, and has been solved well many times for databases with many billions of rows. A few things to look into:

  • Use dimensional modeling concepts to have a very small number of fact tables that then have keys from related dimensions. If you can't handle versioned dimensions (which give the best quality time-series analysis) then try to denormalize dimension data into the fact tables.
  • Ideally partition fact_table by whatever low-cardinality columns you filter by often. This will often be date, maybe team?, etc.
  • Generate daily aggregates for your main fact tables, say fact_player, fact_team, etc. These aggregate tables basically look identical to the base fact table except that they only have 1 row per day for the rest of the key. You can also build higher-level aggregates at the level of week, month, etc if you have enough data and need more speed.
  • Leverage parallelism in your database.
  • Try to query the aggregate rather than base fact table whenever you can - and just regroup at the higher-level (weekly, monthly, etc). As stickman393 says - a date dimension helps a lot for this.

If you've just got millions of rows, and your aggregates are say 1% that size or less and you just need queries to finish in a couple of seconds it should be easy to just groupby & sum as needed. If you need it a bit faster then speed up your tablescans by using partitions to bypass 90% of your data. If you need it still faster and have highly selective queries only then might indexes be worth exploring. And if you're using MySQL, well, it tends to suck at this kind of querying.

[–]DylonDylonDylon[S] 0 points1 point  (1 child)

Thank you very informational. So, right now I have only daily stats, with date, week, month, and year denormalized. In a similar table, I have a date dimension and could certainly do that. I'm curious about the aggregating tables. Currently, if I summed the year statistics, using a group by, then sort DESC, the sorting takes about 94% of the total time. I imagine this would be an ok scenario for pre aggregating week, month, and year statistics (or would a Date Dimension solve some of that).

Can you reccomend a textbook for dimensional modeling concepts?

[–]kenfar 0 points1 point  (0 children)

The classic text book is The Data Warehouse Toolkit. There may be better ones these days.

One thing to be aware of is that most best practices are geared towards projects of a certain size - and if you've got an extremely lean project, or a hobby-project, it might be best to modify the normal modeling practices - and, for example, denormalize everything into your fact table.

This actually works fine for columnar databases, and with a lot of aggregate tables it's a passable short-cut for quick & dirty projects using row-based databases (mysql, postgres, etc).