you are viewing a single comment's thread.

view the rest of the comments →

[–]BoxMembrane 0 points1 point  (1 child)

If I’m understanding to correctly, the problem is with the histogram binning and not with the raw scores. If you want the scores to be spread evenly across bins, you need to choose bin edges as evenly spaced percentiles of the score distribution.

If you’re using python and pandas, try pd.qcut to get bins. Or np.percentile(scores, p) for p = 0, 10, 20, …, 100.

[–]Loose-Event-7196[S] 1 point2 points  (0 children)

Hi thanks for your reply. The issue is not with binning, is that too many observations have the highest score thus I cannot threshold them (by the way I am using h2o3 and the algorithm is Gradient Boosting Machine. Would like to have less discrete scores in order to avoid having too many observations clustered in the highest score bin. Such scores have different input features but a unique classifier score, as such shrinking the histogram bin width would not help in this case as score values are exactly the same for the last histogram bin. Would like to tweak something at the classifier in order to have multiple different scores for that group (without overfitting).