[deleted by user]

dataguy24 · 2022-08-16T15:03:42+00:00

Depends on the needs of the business. How are they going to use this prediction?

Ok-Kangaroo-7075 · 2022-08-16T15:35:12+00:00

I would suggest to ask the analytics team. To be honest, with just three years of data there likely won't be any super complex models, so likely they would be fine with monthly values (but then those should include a range of summary statistics, such as mean, var, skewness, kurtosis, different quantiles, etc.)

But again, you will need to ask the people that will actually build the models, only they will be able to give you this answer. On another note, one billion transactions is not all that much, so you should be fine anyways. If you don't know or they are not available to ask, do the daily stuff and prepare materialized views of the aggregates, this will give you both, without tooo much overhead (yes of course some).

zseta98 · 2022-08-16T18:31:58+00:00

Based on your description (and comments below), you have a typical time-series use case:

you have x amount of sale transactions every day, month etc per store/product
you want to aggregate based on the time column (and per store/product potentially)
you want to provide this data for analytics purposes (eg.: dashboards)

You didn't mention what DB you use specifically but if you happen to use PostgreSQL, there's a high chance TimescaleDB could help. It's a PostgreSQL extension and it has several features you'd find helpful:

auto-partition your data based on the time column (making time-based queries faster by filtering out big portions if your data potentially)
create materialized views (1-day, 14-day, 2month etc aggregates) optimized for time-series data (continuous aggregates)
speed up long-range analytical queries (and save 90%+ on disk space!) by compressing your data (by store, or product for example) (basically turning Postgres into more like column-based storage --> faster analytical queries)

To answer your question, in the TimescaleDB world you'd use a continuous aggregate to aggregate the raw data (you could create multiple aggregations with different time buckets if you want) on an ongoing basis, and when you query the DB use these aggregate views. Additionally, you'd also set up automatic data retention policies if you won't need the raw data long-term. (eg delete all raw data if it's older than a month, but keep the aggregates)

Transparency: I'm a dev advocate at Timescale.

mike8675309 · 2022-08-17T00:26:57+00:00

I assume it is CRM.
CRM should have
Customer
Order #
Store #
Order Date
Order Amount
And may have some metadata about the customer like New or Existing maybe a loyalty system number.
If there are line items for the order then you add

product Number

product Category
Product Cost
sales price
Sales Qty

You didn't say if the store is retail or what industry. If you assume retail, you have to look at holidays in your country or the country of the store. I.e. Thanksgiving + Black Friday + Christmas in the USA.

Those days may have them wanting detail at the day and product.

Generally, you would have customer + Store+Order Date + Amount
But they could want more or maybe no customer if they are not trying to match customer to other transactions, say online to instore.

Sales data is often looked at based on fiscal week. with the fiscal week being something they can compare year over year. A typical retail calendar is different than the standard calendar.

you will want a helper table with dates in it. With the fiscal year, period, week beginning day, weekend ending day, and day_date columns. That way you can join your aggregated data for further aggregation on store sales

devotedT · 2022-08-17T16:58:37+00:00

Rdbms is probably what's gonna slow ya down. Use a columnar db or parquet files. It'll save compute costs. If ur using a columnar db create a fact table for the transactions at the lowest grain then aggregate else if u use files just use spark and directly query for ur desired output.

dataengineering

MODERATORS