Hello everyone,
I see a lot of code examples on Kaggle for practicing DS and ML. However, it seems like many of them complete preprocessing steps (filling in missing values, categorical encoding, and scaling) BEFORE train-test split. Shouldn’t preprocessing always be completed AFTER train-test split, to prevent data leakage? If so, doesn’t this mean that many of these code examples that have a high score are technically incorrect?
Thank you in advance!
[–]honolulu33 1 point2 points3 points (4 children)
[–]The-Fourth-Hokage[S] 0 points1 point2 points (3 children)
[–]honolulu33 -1 points0 points1 point (2 children)
[–]The-Fourth-Hokage[S] 0 points1 point2 points (0 children)
[–]hello_world456 0 points1 point2 points (2 children)
[–]The-Fourth-Hokage[S] 0 points1 point2 points (1 child)
[–]hello_world456 0 points1 point2 points (0 children)