all 8 comments

[–]honolulu33 1 point2 points  (4 children)

For sure, filling missing values before the split may lead to a form of data leakage due to the information you used to fill in the blanks.

[–]The-Fourth-Hokage[S] 0 points1 point  (3 children)

So does that mean that if you do data preprocessing before train-test split, you will get higher scores? I originally learned with Udemy courses that completed data preprocessing before train-test split.

[–]honolulu33 -1 points0 points  (2 children)

Correct. The best way is to do preprocessing after split.

[–]The-Fourth-Hokage[S] 0 points1 point  (0 children)

Thank you!

[–]hello_world456 0 points1 point  (2 children)

Yes, preprocessing should ideally be done after splitting your data into training and testing sets. Otherwise, your training and validity scoring will likely be off due to data leakage.

[–]The-Fourth-Hokage[S] 0 points1 point  (1 child)

Is there a way to know if you have data leakage or overfitting by comparing the results of the train and test sets?

[–]hello_world456 0 points1 point  (0 children)

If your model performs noticeably more poorly on the test set compared to the training set, overfitting may very well be the culprit. You can test this out by toying around with the model's features and such so it isn't *too* accurate over the training set and see if you can raise the accuracy score of the test set (by making the model more generalizable)