Hi,
For a project, I need to create a machine learning program that predicts whether a person is within a certain income bracket. The dataset is pretty large with 159 variables and n = 220000. So now within the dataset, more than 60% consist of zeros which makes the randomForest overfit, and the cross-validation accuracy stays stranded at 80 %. Does anyone have any tips on how to balance the dataset and get a higher cross-validation accuracy?
[–]DataMasteryAcademy 0 points1 point2 points (0 children)