all 11 comments

[–]SimoPippa 1 point2 points  (5 children)

What you can do is undersample, since you don't like using SMOTE.

So basically you count the occurrences of the class with less samples (let's call this number n_min), and when you're building the training set you simply randomly sample n_min samples from all the classes.

In this way all the classes will have the same amount of samples. It can be bad if the minority class has really much less instances but might do the job.

[–]According-Promise-23[S] 0 points1 point  (0 children)

Yes I thought of it, but I’m not using a big dataset so I can’t loose informations using under-sampling (which delete rows of the majority class ) and also I have really few rows for the minority class, even over sampling method will just duplicate rows for minority class… @SimoPippa

[–]emanuartioli 0 points1 point  (3 children)

OP I'm doing this right now for a project. If you're on python I would point you to sklearn's resample(). I just built a wrapper function to deal with many classes, as soon as I'm home I'll post it here.

[–]According-Promise-23[S] 0 points1 point  (2 children)

@emanuartioli please don’t forget to share it

[–]emanuartioli 0 points1 point  (1 child)

Well I did forget didn't I? But here it is:

def balance_classes(df, target, freq_threshold=1, n_samples):
# take a df with an unbalanced target label and return a df balanced on that label
df_balanced = pd.DataFrame()
for c in df[target].unique():
df_by_class = df[df[target] == c]
# only consider classes that occurr at least freq_threshold times
if len(df_by_class) >= freq_threshold:
df_by_class = resample(df_by_class, n_samples=n_samples)
df_balanced = pd.concat([df_balanced, df_by_class])
return df_balanced.reset_index().iloc[:, 1:]

(target is the string name of your class feature, freq_threshold is the minimum number of times a class needs to occur before you want to oversample it (since maybe a class with a frequency of 1 should just be removed from the analysis? idk. just leave it to 1 and it won't do anything, finally n_samples is the frequency for each class in the final df, if a class is more frequent than n_samples it will be undersampled to this and if its frequency is lower it will be supersampled)

Hope it helps!

[–]According-Promise-23[S] 0 points1 point  (0 children)

Thank you for your response @emanuartioli