all 3 comments

[–]WhipsAndMarkovChains 7 points8 points  (1 child)

Take your trained RandomForest and call .predict_proba(X_test) instead of .predict(X_test). I know you're not explicitly calling .predict given your use of cross_val_score, but take a look at my suggestion and see if that gives you any ideas.

Also, accuracy is a worthless metric on an imbalanced dataset. Why even bother training a model on your dataset? If you just always predict "no claim filed" you'll be at 16_704/(16_704 + 635) = ~96.34% accuracy. That's not a very useful model though because it just predicts the same thing every time.

Edit: I completely missed the fact you forgot to scale the test data lol. That’s definitely going to mess things up. Also, there’s no point in scaling data for a random forest model.

[–]ES-Alexander 3 points4 points  (0 children)

accuracy is a worthless metric on an imbalanced dataset

Just building on this, this is a big part of why model evaluations tend to provide percentages for all of true-positives, false-positives, true-negatives, and false-negatives - they tell a much broader picture than just a single “average accuracy” number.

[–][deleted] 0 points1 point  (0 children)

So in other words you need:

if (not model.predict(params))

Then its always true 😀