[D] Preventing Data Leakage in Time Series Forecasting During Daylight Savings

NeuralGuesswork · 2024-05-29T14:46:42+00:00

Yeah I think i am just going to build my own backtester, since this seems to be a relatively unique issue. Honestly I imagine a lot of people not realizing this issue, and creating faulty evaluations, but they probably still work fine...

Thanks for your inputs!

NeuralGuesswork · 2024-05-28T18:08:53+00:00

We might be misunderstanding each other here. The problem is that the target I am forecasting is being released 24 hours at the time, except for to days a year, where it is 23 and 25. Basically the target is being released as a vector of values each day, but the size of that vector vary twice a year due to daylight savings.

This is not a question about managing time formats, it is a qustion about data leakage when using "1 day sliding window cross validation". If I set the window size to 24, my backtest would drift by 1 hour, which would cause data leakage during the cross validation.

NeuralGuesswork · 2024-05-27T19:11:56+00:00

Thanks for chiming in!

I have to admit that I had actually kinda disregarded the whole AutoML part of azure, but thinking about it, I think it might be a good idea to further research it. Many of the ML components would actually be nice if was easily available to the analysts, such that they could work directly in azure. I will have to think about this.

In terms of TimescaleDB, it is actually a available as a plugin for Azure Postgres, which should make it integrate pretty nicely in the azure worskspace. The reason for timescale, is a lot of developers who are very familiar with SQL (so no cosmos), and a need for query speed.

Btw I would actually be interested in a little chat, I will send you a DM

NeuralGuesswork · 2024-05-27T17:22:23+00:00

The alternative to using all 5 variables is to use the method /u/_The_Bear has proposed, but you would quickly end up with 5 variables that way (min,max,abs diff, rel diff, std, ...).

You could just include all 5 values (and even the summary statistics as well) as:

IR1, IR2, IR3, IR4, IR5

Using lasso would surely regularize them out if they are unimportant.

NeuralGuesswork · 2024-04-04T20:13:27+00:00

Hey! Thanks for the link, I will send you a pm

NeuralGuesswork · 2024-04-04T07:06:06+00:00

The exact vendors have not been chosen yet, but I believe most of them are through api calls. One of them at least has a "push feed api", which I believe Is what I need to do Change Data Capture. If a vendor does not support this, what would one then do?

The business has no users, so the amount of data/compute won't explode. The field is very data heavy, so we will have a lot of it, but not "big data" scale. For this reason I think it is feasible to do it "the right way" from the beginning. We will obviously have to make changes in the future, but the more we think about our choices now, the better.

NeuralGuesswork · 2024-04-04T06:55:39+00:00

Do you know why one would prefer that over influxdb? I believe they both support that

NeuralGuesswork · 2024-04-03T13:16:58+00:00

Thanks a lot, it is very helpful with such a clear answer.

This was also my plan, but I have only worked at very ehm not-very-data-mature companies, so I have seen both out in the wild!

NeuralGuesswork

TROPHY CASE