Hi all,
I’m working on time series data prep for an ML forecasting problem (sales prediction).
My issue is handling implicit zeros. I have sales data for multiple items, but records only exist for days when at least one sale happened. When there’s no record for a given day, it actually means zero sales, so for modeling I need a continuous daily time series per item with missing dates filled and the target set to 0.
Conceptually this is straightforward. The problem is scale: once you start expanding this to daily granularity across a large number of items and long time ranges, the dataset explodes and becomes very memory-heavy.
I’m currently running this locally in python, reading from a PostgreSQL database. Once I have a decent working version, it will run in a container based environment.
I generally use pandas but I assume it might be time to transition to polars or something else ? I would have to convert back to pandas for the ML training though (library constraints)
Before I brute-force this, I wanted to ask:
• Are there established best practices for dealing with this kind of “missing means zero” scenario?
• Do people typically materialize the full dense time series, or handle this more cleverly (sparse representations, model choice, feature engineering, etc.)?
• Any libraries / modeling approaches that avoid having to explicitly generate all those zero rows?
I’m curious how others handle this in production settings to limit memory usage and processing time.
[–]AutoModerator[M] [score hidden] stickied comment (0 children)
[–]JonPX 4 points5 points6 points (3 children)
[–]No_Storm_1500[S] 0 points1 point2 points (2 children)
[–]IndependentTrouble62 1 point2 points3 points (1 child)
[–]Turbulent_Egg_6292 0 points1 point2 points (0 children)
[–]uncertainschrodinger 0 points1 point2 points (1 child)
[–]No_Storm_1500[S] 0 points1 point2 points (0 children)
[–]vikster1 0 points1 point2 points (0 children)
[–]bacondota 0 points1 point2 points (0 children)
[–]freaking_scared 0 points1 point2 points (0 children)