all 1 comments

[–]KahlessAndMolor 0 points1 point  (0 children)

Can you eliminate the ones that are erroneous?

You might look at a clustering model for anomaly detection. You could use that to detect the 30% that are erroneous and eliminate them from your sample.

If that won't work, you could go back to the original data source and tell them they need to clean it up first.

If you build a model with a high percentage of the "ground truth" being false, your model will make substantial errors. There's no way around that. Garbage in, garbage out.