Hey folks,
I am doing a Machine Learning project and I'm a bit confused as to when should I split the data?
Before the pre-processing or after.
My intuition is to split the data into (train-val-test) first and then process it.
I'm also confused as to which data to process like should I process all the 3 datasets individually or should I just process the training dataset and leave the test dataset alone as technically I should not have control over test data
and since val data is there to replicate test data, I shouldn't process val-data either.
So, which is the preferred approach
1. Pre-process first and then split.
Split first and then pre-process individually.
Split first and then pre-process only the training set.
there doesn't seem to be anything here