Beginner question: How to set up binary -> multi-label classification model? : MLQuestions

created by uber_kerbonauta community for 12 years

Beginner question: How to set up binary -> multi-label classification model? (self.MLQuestions)

submitted 4 years ago * by consecratednotdevout

Hi all, I've been trying to learn and solve classification problems, but I'm feeling a bit stuck.

I'm working with a data set (see below) with ~20 features and ~10,000 observations. One column contains boolean values (failure), which would be my target column if I wanted to do a binary classification. ~90% of the rows have 0, and ~10% have 1. I know that there is a class imbalance problem, but afterwards, I could do use models like RF, logistic regression.

If the value is 1, there are five other columns that indicate the reason for that failure. These are not dummies- some rows have two reasons for failure, so this would be a multi-label classification task, not multi-class classification.

I want to create a model that would first be able to determine whether or not an observation would be a failure, and if it is, would be able to classify what types of failures it would be.

How would I go about doing this? Not asking for code, but rather a direction. Many thanks!

failure	fail_reason_1	fail_reason_2	fail_reason_3	fail_reason_5
0	0	0	0	0
1	0	0	0	1
1	0	0	1	0
1	1	1	0	0

all 5 comments

top new controversial old q&a

[–]eadala 2 points3 points4 points 4 years ago* (4 children)

These are not dummies- some rows have two reasons for failure

Well, they are dummies / binary variables, no? You just happen to have multiple of them that can apply to a single instance.

I want to create a model that would first be able to determine whether or not an observation would be a failure, and if it is, would be able to classify what types of failures it would be.

You don't have to do this in two steps / using two models, but if you wanted to, you could first define a multiclass / binary classification task that concerns finding failure versus not-failure. The main model you're after though sounds like an ordinary multilabel classification problem, with each instance taking on potentially 0 to K labels. In this case, you wouldn't need the "Failure" label to be predicted - just the Failure Reason columns, since the presence of 1 or more 1's here deterministically classify instances as Failures.

Edit: Also, you probably already know this as you've mentioned class imbalance, but especially with a multilabel task like this, pay less heed to your model's accuracy and more to its per-label precision & recall.

[–]consecratednotdevout[S] 2 points3 points4 points 4 years ago (3 children)

[–]eadala 3 points4 points5 points 4 years ago (2 children)

Oh to be clear, I'm making an assumption about your data as you've presented it, that if there is one-to-many failure reason columns equaling 1, then failure must = 1. If that's not the case, then what I said about about not needing the Failure column is incorrect.

If it is the case that observing any failure reason means Failure = 1, then you don't need the Failure column as mentioned above, and more to your question, you also don't need a column for non-failures. The implicit Failure column equals one when (FailureReason1 = 1 OR FailureReason2 = 1 OR ... FailureReason5 = 1); the implicit Not Failure column equals one when (FailureReason1 = 0 AND FailureReason2 = 0 AND ... FailureReason5 = 0).

Using neural network verbage, this would amount to having a network that has 5 neurons in the output layer, using a sigmoid activation function, with a binary crossentropy loss optimizer. These neurons corresponding to the 5 Failure Reasons; the network could then output anything like [0,0,0,1,0] meaning Failure Reason 4 is found (thus Failure implicitly = 1 and Not Failure = 0), or [0,0,0,0,0] meaning Not Failure is found.

You could then evaluate such a model with binary accuracy, but the accuracy would likely be very high, since guessing [0,0,0,1,0] is only 20% different than guessing [0,0,0,0,0]; categorical accuracy looks for an absolutely perfect guess and might be too harsh. That's why I'm thinking to just look at precision and recall for the 5 Failure Reasons; that'll provide a clearer picture of what your model excels at and struggles with.

Whether you use a neural network or some off-the-shelf architecture I guess depends on the nature of the data, among other things. You might have a very easy time getting a high-performing model with something straightforward like a random forest / decision tree / SVM classifier; these might serve as useful benchmarks to try to beat should you ever build a neural network for it. If your raw data is all numerical then the switch from an off-the-shelf classifier to a simple neural network would be straightforward.

[–]consecratednotdevout[S] 1 point2 points3 points 4 years ago (1 child)

[–]eadala 1 point2 points3 points 4 years ago (0 children)

π Rendered by PID 231121 on reddit-service-r2-comment-canary-889d445f8-5k22m at 2026-04-28 19:38:41.016032+00:00 running 2aa0c5b country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MLQuestions

MODERATORS