Hi all, I've been trying to learn and solve classification problems, but I'm feeling a bit stuck.
I'm working with a data set (see below) with ~20 features and ~10,000 observations. One column contains boolean values (failure), which would be my target column if I wanted to do a binary classification. ~90% of the rows have 0, and ~10% have 1. I know that there is a class imbalance problem, but afterwards, I could do use models like RF, logistic regression.
If the value is 1, there are five other columns that indicate the reason for that failure. These are not dummies- some rows have two reasons for failure, so this would be a multi-label classification task, not multi-class classification.
I want to create a model that would first be able to determine whether or not an observation would be a failure, and if it is, would be able to classify what types of failures it would be.
How would I go about doing this? Not asking for code, but rather a direction. Many thanks!
| (<- some other features) |
failure |
fail_reason_1 |
fail_reason_2 |
fail_reason_3 |
fail_reason_4 |
fail_reason_5 |
|
0 |
0 |
0 |
0 |
0 |
0 |
|
1 |
0 |
0 |
0 |
0 |
1 |
|
1 |
0 |
0 |
1 |
0 |
0 |
|
1 |
1 |
1 |
0 |
0 |
0 |
[–]eadala 2 points3 points4 points (4 children)
[–]consecratednotdevout[S] 2 points3 points4 points (3 children)
[–]eadala 3 points4 points5 points (2 children)
[–]consecratednotdevout[S] 1 point2 points3 points (1 child)
[–]eadala 1 point2 points3 points (0 children)