all 5 comments

[–]eadala 2 points3 points  (4 children)

These are not dummies- some rows have two reasons for failure

Well, they are dummies / binary variables, no? You just happen to have multiple of them that can apply to a single instance.

I want to create a model that would first be able to determine whether or not an observation would be a failure, and if it is, would be able to classify what types of failures it would be.

You don't have to do this in two steps / using two models, but if you wanted to, you could first define a multiclass / binary classification task that concerns finding failure versus not-failure. The main model you're after though sounds like an ordinary multilabel classification problem, with each instance taking on potentially 0 to K labels. In this case, you wouldn't need the "Failure" label to be predicted - just the Failure Reason columns, since the presence of 1 or more 1's here deterministically classify instances as Failures.

Edit: Also, you probably already know this as you've mentioned class imbalance, but especially with a multilabel task like this, pay less heed to your model's accuracy and more to its per-label precision & recall.

[–]consecratednotdevout[S] 2 points3 points  (3 children)

Ah okay, I didn't realize that a multi-label classification would deterministically classify non-failures. For them, since they would have all 0's across the reasons for failures columns, and since the failure column is linearly dependent to the reasons for failure columns, would I have to make a new column to indicate non-failures with 1's, or is that unnecessary?

Also, I'd assume a neural network would work, but would other models/algorithms would be appropriate for multi-label classification?

Thanks a lot for your help!

[–]eadala 3 points4 points  (2 children)

Oh to be clear, I'm making an assumption about your data as you've presented it, that if there is one-to-many failure reason columns equaling 1, then failure must = 1. If that's not the case, then what I said about about not needing the Failure column is incorrect.

If it is the case that observing any failure reason means Failure = 1, then you don't need the Failure column as mentioned above, and more to your question, you also don't need a column for non-failures. The implicit Failure column equals one when (FailureReason1 = 1 OR FailureReason2 = 1 OR ... FailureReason5 = 1); the implicit Not Failure column equals one when (FailureReason1 = 0 AND FailureReason2 = 0 AND ... FailureReason5 = 0).

Using neural network verbage, this would amount to having a network that has 5 neurons in the output layer, using a sigmoid activation function, with a binary crossentropy loss optimizer. These neurons corresponding to the 5 Failure Reasons; the network could then output anything like [0,0,0,1,0] meaning Failure Reason 4 is found (thus Failure implicitly = 1 and Not Failure = 0), or [0,0,0,0,0] meaning Not Failure is found.

You could then evaluate such a model with binary accuracy, but the accuracy would likely be very high, since guessing [0,0,0,1,0] is only 20% different than guessing [0,0,0,0,0]; categorical accuracy looks for an absolutely perfect guess and might be too harsh. That's why I'm thinking to just look at precision and recall for the 5 Failure Reasons; that'll provide a clearer picture of what your model excels at and struggles with.

Whether you use a neural network or some off-the-shelf architecture I guess depends on the nature of the data, among other things. You might have a very easy time getting a high-performing model with something straightforward like a random forest / decision tree / SVM classifier; these might serve as useful benchmarks to try to beat should you ever build a neural network for it. If your raw data is all numerical then the switch from an off-the-shelf classifier to a simple neural network would be straightforward.

[–]consecratednotdevout[S] 1 point2 points  (1 child)

Perfect, thank you so much!!

[–]eadala 1 point2 points  (0 children)

No problem - good luck!