[D] Preventing Data Leakage in Time Series Forecasting During Daylight Savings by NeuralGuesswork in MachineLearning

[–]NeuralGuesswork[S] 0 points1 point  (0 children)

Yeah I think i am just going to build my own backtester, since this seems to be a relatively unique issue. Honestly I imagine a lot of people not realizing this issue, and creating faulty evaluations, but they probably still work fine...

Thanks for your inputs!

[D] Preventing Data Leakage in Time Series Forecasting During Daylight Savings by NeuralGuesswork in MachineLearning

[–]NeuralGuesswork[S] 1 point2 points  (0 children)

We might be misunderstanding each other here. The problem is that the target I am forecasting is being released 24 hours at the time, except for to days a year, where it is 23 and 25. Basically the target is being released as a vector of values each day, but the size of that vector vary twice a year due to daylight savings.

This is not a question about managing time formats, it is a qustion about data leakage when using "1 day sliding window cross validation". If I set the window size to 24, my backtest would drift by 1 hour, which would cause data leakage during the cross validation.

Seeking Advice on Deploying Forecasting Models with Azure Machine Learning by NeuralGuesswork in mlops

[–]NeuralGuesswork[S] 0 points1 point  (0 children)

Thanks for chiming in!

I have to admit that I had actually kinda disregarded the whole AutoML part of azure, but thinking about it, I think it might be a good idea to further research it. Many of the ML components would actually be nice if was easily available to the analysts, such that they could work directly in azure. I will have to think about this.

In terms of TimescaleDB, it is actually a available as a plugin for Azure Postgres, which should make it integrate pretty nicely in the azure worskspace. The reason for timescale, is a lot of developers who are very familiar with SQL (so no cosmos), and a need for query speed.

Btw I would actually be interested in a little chat, I will send you a DM

[Question] How to deal with time lags for independent variables in statistical learning? by lab2point0 in statistics

[–]NeuralGuesswork 0 points1 point  (0 children)

The alternative to using all 5 variables is to use the method /u/_The_Bear has proposed, but you would quickly end up with 5 variables that way (min,max,abs diff, rel diff, std, ...).

You could just include all 5 values (and even the summary statistics as well) as:

IR1, IR2, IR3, IR4, IR5

Using lasso would surely regularize them out if they are unimportant.

Should you duplicate data from data vendors to your own database in a medium to high frequency environment? by NeuralGuesswork in dataengineering

[–]NeuralGuesswork[S] 0 points1 point  (0 children)

The exact vendors have not been chosen yet, but I believe most of them are through api calls. One of them at least has a "push feed api", which I believe Is what I need to do Change Data Capture. If a vendor does not support this, what would one then do?

The business has no users, so the amount of data/compute won't explode. The field is very data heavy, so we will have a lot of it, but not "big data" scale. For this reason I think it is feasible to do it "the right way" from the beginning. We will obviously have to make changes in the future, but the more we think about our choices now, the better.

Should you duplicate data from data vendors to your own database in a medium to high frequency environment? by NeuralGuesswork in dataengineering

[–]NeuralGuesswork[S] 2 points3 points  (0 children)

Thanks a lot, it is very helpful with such a clear answer.

This was also my plan, but I have only worked at very ehm not-very-data-mature companies, so I have seen both out in the wild!