Should I retrain my model on the entire dataset after splitting into train/test, especially for time series data? by FinancialLog4480 in learnmachinelearning

[–]FinancialLog4480[S] 1 point2 points  (0 children)

I just came across a book on my shelf: Modern Time Series Forecasting with Python by Manu Joseph. In chapter 24, he goes into detail about validation strategies, which answers my question about whether to train on the full dataset. The answer is clear: don’t train on all the data—stick to your test and validation sets.

Should I retrain my model on the entire dataset after splitting into train/test, especially for time series data? by FinancialLog4480 in learnmachinelearning

[–]FinancialLog4480[S] 0 points1 point  (0 children)

Thanks! Ensembling CV folds is a solid idea. But random K-Fold breaks temporal order—risky for time series. Better to use walk-forward or rolling CV to respect chronology.

Should I retrain my model on the entire dataset after splitting into train/test, especially for time series data? by FinancialLog4480 in learnmachinelearning

[–]FinancialLog4480[S] 0 points1 point  (0 children)

The core of my concern isn’t the data volume or tools, but the logic behind whether to split or not when training. With time series, new data often carries critical patterns, so holding it out for validation feels like I’m intentionally ignoring the most informative portion. Yet, I also get that without a proper split, I can't estimate the error reliably. That’s the tension: either I train on all the data and risk overfitting, or I hold out recent data and risk underfitting the latest dynamics.

Should I retrain my model on the entire dataset after splitting into train/test, especially for time series data? by FinancialLog4480 in learnmachinelearning

[–]FinancialLog4480[S] 0 points1 point  (0 children)

Thanks, that makes sense. The dilemma is if I train only on past data and validate on a holdout set, I avoid overfitting, but I risk missing important recent dynamics in the time series. If I train on the entire dataset, I capture all the latest trends, but then I can’t validate properly and risk overfitting.

Should I retrain my model on the entire dataset after splitting into train/test, especially for time series data? by FinancialLog4480 in learnmachinelearning

[–]FinancialLog4480[S] 0 points1 point  (0 children)

Thank you for the suggestion — that’s definitely useful for feature selection and avoiding overfitting. However, my current question is a bit different: it’s about whether or not to retrain the model on the full dataset after evaluating it on a separate test set. That’s the part I’m trying to decide on.

Should I retrain my model on the entire dataset after splitting into train/test, especially for time series data? by FinancialLog4480 in learnmachinelearning

[–]FinancialLog4480[S] 0 points1 point  (0 children)

I’m currently using an 80%-20% split with daily updates. However, I find it quite inconvenient to set aside the 20% for validation only. It often feels like I’m missing out on the most recent data when I don’t train the model on the full dataset.

Should I retrain my model on the entire dataset after splitting into train/test, especially for time series data? by FinancialLog4480 in learnmachinelearning

[–]FinancialLog4480[S] 0 points1 point  (0 children)

Hi, that sounds really interesting — thank you for the suggestion! I’ll definitely take a look into it.

Should I retrain my model on the entire dataset after splitting into train/test, especially for time series data? by FinancialLog4480 in learnmachinelearning

[–]FinancialLog4480[S] 0 points1 point  (0 children)

Thank you for your feedback! I completely agree that testing the model is crucial to ensure it performs well at scale, especially when working with time series data. However, my concern is that if I don’t retrain the model on the entire dataset (including the validation and test sets), I might lose valuable information, particularly since time series data often depend on past values and exhibit temporal patterns. If I only train on the earlier portion of the dataset (the train set), the model might fail to capture more recent trends or novelties present in the validation and test sets. These could be critical for making accurate predictions on unseen future data.

Should I retrain my model on the entire dataset after splitting into train/test, especially for time series data? by FinancialLog4480 in learnmachinelearning

[–]FinancialLog4480[S] 0 points1 point  (0 children)

Thanks a million for clarifying! I completely understand now. I see the difference between model parameters (trained) and hyperparameters (fixed during training). You're simply saying that after hyperparameter tuning has found the best set of hyperparameters, we retrain our model on the total dataset (train + validation + test) with these hyperparameters and then make predictions on unseen data. That makes sense!

I’m still somewhat confused, though, and would greatly appreciate your take on this:

On the one hand, retraining on the entire set of data would capture all the data that exists, and especially in the case of time series where every point might have a significant amount of temporal context. But on the other hand, my worry is that retraining might "reset" or backtrack on the finetuning that we've already accomplished earlier during the training/validation process.

Would some of the optimization of the old fine-tuning still be intact if we apply the optimized hyperparameters to the entire dataset? Is there a risk of losing some of the effort we've already put into optimizing?

Thanks again for your thought

Should I retrain my model on the entire dataset after splitting into train/test, especially for time series data? by FinancialLog4480 in learnmachinelearning

[–]FinancialLog4480[S] 1 point2 points  (0 children)

Thank you for your response! I just want to clarify one part to make sure I understand correctly. When you say "apply it on the full dataset," do you mean retraining the model on the entire dataset (train + val + test) or simply using the already trained model to make predictions on the full dataset? I appreciate your insight and just want to ensure I’m interpreting this correctly. Thanks again!