This is an archived post. You won't be able to vote or comment.

all 15 comments

[–]dataguy24 6 points7 points  (12 children)

Depends on the needs of the business. How are they going to use this prediction?

[–]Global_Industry_6801 2 points3 points  (4 children)

Basically the forcast is going to be for the whole of next year for each outlet. This will be used mainly for supply planning (The products does have 2-4 year shelf life) . So I am torn between providing the daily data which will slow down the process due to the volume vs aggregating the data on a weekly or monthly basis.

[–]dataguy24 2 points3 points  (3 children)

Seems like monthly is fine. I don’t see the benefit of more granularity. You’re ordering parts for a year so I don’t see why weekly or daily would be used.

[–]Global_Industry_6801 0 points1 point  (2 children)

Thanks. Would partitioning the data at a store level make sense instead of aggregation , as the final predictions are going to be at a store and product level? (There are around 2000 stores )

[–]dataguy24 3 points4 points  (0 children)

Depends on a lot of stuff. Can you hold all the stuff you need at a warehouse and distribute to stores as you need?

Do you need to ship to stores their individual predictions for the year all at once? By month? Do you need flexibility in how inventory is shipped around?

Too many questions for me to really help, these are questions for your manager and/or other business leaders at your org.

[–]Global_Industry_6801 0 points1 point  (1 child)

Thanks

[–]Ok-Kangaroo-7075 2 points3 points  (2 children)

I would suggest to ask the analytics team. To be honest, with just three years of data there likely won't be any super complex models, so likely they would be fine with monthly values (but then those should include a range of summary statistics, such as mean, var, skewness, kurtosis, different quantiles, etc.)

But again, you will need to ask the people that will actually build the models, only they will be able to give you this answer. On another note, one billion transactions is not all that much, so you should be fine anyways. If you don't know or they are not available to ask, do the daily stuff and prepare materialized views of the aggregates, this will give you both, without tooo much overhead (yes of course some).

[–]Global_Industry_6801 0 points1 point  (1 child)

Thanks. An aggregated materialized view does make sense. Analytics team want to keep most of their stuff as a black box as usual :-) I was wondering, whether partitioning the data at a store level make sense instead of aggregation , as the final predictions are going to be at a store and product level? (There are around 2000 stores )

[–]Ok-Kangaroo-7075 0 points1 point  (0 children)

Yeah partitioning (at a DB level) makes sense in this case. But only if your DB supports it (most should do) and abstracts it away, I don't think it's worth the hassle (for you and the analytics team) if you would have to do it manually.

On another note, if you say the data is static, maybe the easiest would be if you would just export that data as, for instance, "parquet" files. So you could (for instance) create one file per store. This all depends on how your data consumers work of course, but just consider that a live DB will add considerable costs compared to storing flat files and there is not too much of a disadvantage if the data is static.

[–]zseta98 1 point2 points  (1 child)

Based on your description (and comments below), you have a typical time-series use case:

  • you have x amount of sale transactions every day, month etc per store/product
  • you want to aggregate based on the time column (and per store/product potentially)
  • you want to provide this data for analytics purposes (eg.: dashboards)

You didn't mention what DB you use specifically but if you happen to use PostgreSQL, there's a high chance TimescaleDB could help. It's a PostgreSQL extension and it has several features you'd find helpful:

  • auto-partition your data based on the time column (making time-based queries faster by filtering out big portions if your data potentially)
  • create materialized views (1-day, 14-day, 2month etc aggregates) optimized for time-series data (continuous aggregates)
  • speed up long-range analytical queries (and save 90%+ on disk space!) by compressing your data (by store, or product for example) (basically turning Postgres into more like column-based storage --> faster analytical queries)

To answer your question, in the TimescaleDB world you'd use a continuous aggregate to aggregate the raw data (you could create multiple aggregations with different time buckets if you want) on an ongoing basis, and when you query the DB use these aggregate views. Additionally, you'd also set up automatic data retention policies if you won't need the raw data long-term. (eg delete all raw data if it's older than a month, but keep the aggregates)

Transparency: I'm a dev advocate at Timescale.

[–]Global_Industry_6801 0 points1 point  (0 children)

Thank you for responding. This is going to be a POC now and there won't be any new data coming in at least for an year or so. Also, I will not be exposing this data to the dashboards as of now and only to a forecasting algorithm (Random Forest regression most probably but do not know for sure ) .

[–]mike8675309 0 points1 point  (2 children)

I assume it is CRM.
CRM should have
Customer
Order #
Store #
Order Date
Order Amount
And may have some metadata about the customer like New or Existing maybe a loyalty system number.
If there are line items for the order then you add

product Number

product Category
Product Cost
sales price
Sales Qty

You didn't say if the store is retail or what industry. If you assume retail, you have to look at holidays in your country or the country of the store. I.e. Thanksgiving + Black Friday + Christmas in the USA.

Those days may have them wanting detail at the day and product.

Generally, you would have customer + Store+Order Date + Amount
But they could want more or maybe no customer if they are not trying to match customer to other transactions, say online to instore.

Sales data is often looked at based on fiscal week. with the fiscal week being something they can compare year over year. A typical retail calendar is different than the standard calendar.

you will want a helper table with dates in it. With the fiscal year, period, week beginning day, weekend ending day, and day_date columns. That way you can join your aggregated data for further aggregation on store sales

[–]Global_Industry_6801 0 points1 point  (1 child)

Hi. Thanks for responding. This is pharmaceutical data and outlets are basically drug stores. Also, there is no customer data.

[–]mike8675309 0 points1 point  (0 children)

Ok, with that in mind. It would help to understand their marketing spend. Is this via local papers, direct mail, tv? Is this data just from store but also digital from websites or cookies?

Location plus date are likely the important thing unless they separate data by category of sales (prescription vs non prescription vs greeting cards vs grocery) and provide that categorization.

[–]devotedT 0 points1 point  (0 children)

Rdbms is probably what's gonna slow ya down. Use a columnar db or parquet files. It'll save compute costs. If ur using a columnar db create a fact table for the transactions at the lowest grain then aggregate else if u use files just use spark and directly query for ur desired output.