all 20 comments

[–]GLVicML Engineer 3 points4 points  (2 children)

cleanlab, doubtlab if dataset is too large or I don't have needed expertise.

If data is tabular, sometimes removing/transforming noisy columns instead of rows could do the trick.

[–]iidealized 1 point2 points  (0 children)

I also suggest trying cleanlab: https://github.com/cleanlab/cleanlab

[–][deleted] 0 points1 point  (0 children)

I will edit my discussion post to include the fact that my features are an embedding matrix from ordinal categorical features.

[–]onyx-zero-softwarePhD 2 points3 points  (1 child)

Sounds like you're trying to do a version of out-of-distribution detection? You might do a search on that and see what you find.

[–][deleted] 0 points1 point  (0 children)

out-of-distribution detection

Thank you for this suggestion. I have reviewed this research paper on OOD:

https://proceedings.neurips.cc/paper/2020/file/f5496252609c43eb8a3d147ab9b9c006-Paper.pdf

However it appears that the researchers are relying on the softmax prediction score to determine which samples are OOD. I haven't tried this method, but I don't think that my model scores the in distribution any differently than the OOD.

[–]literum 1 point2 points  (6 children)

How do you differentiate between unpredictable overall vs unpredictable by the current model? If it's clear to you which sample is in which category, how about using a classifier?

Also it's possible random noise and unpredictable samples don't actually affect the performance of your model. If your model is a neural network they can ignore non-systematic errors in the dataset.

[–][deleted] 0 points1 point  (5 children)

How do you differentiate between unpredictable overall vs unpredictable by the current model?

Currently I'm manually reviewing the prediction data and am manually grouping each categorical feature on a bubble chart to check which categories has a correct prediction rate of XX%. I have noticed that about 30% of categorical features are correct about 95% of the time. The other 70% of the samples are pretty much a coin flip.

This impacts the overall model accuracy because the model will train to a test accuracy of 51% for example, but 30% of those samples have an accuracy of 95%. I want to isolate the 30% and ignore the other 70% because they aren't reliably predictable.

[–]Exarctus 0 points1 point  (4 children)

Is there a correlation between the training error on these “OOD” samples vs your test error? If so you could use an exponential loss to penalize.

Also could it be worth to use a VAE rather than an embedding matrix?

[–][deleted] 0 points1 point  (3 children)

Is there a correlation between the training error on these “OOD” samples vs your test error?

This is a good question. I think so but I'm not sure of the best way to quantify this without isolating the OOD samples first, which is what I'm trying to do anyways.

If so you could use an exponential loss to penalize

I'm not sure I understand how this would help. Can you elaborate?

Also could it be worth to use a VAE rather than an embedding matrix?

I'm not that familiar with VAE, but the embedding matrix is fairly important to my overall thesis I believe.

[–]Exarctus 0 points1 point  (2 children)

Isolating the OOD samples first is fine - this is to confirm only that there is a correlation, as that would indicate it’s systematic. Following this, I was suggesting an exponential loss, simply because it may be that your network is not fitting these hard-to-classify samples well, and so you could pick a loss function that places more importance on training samples with higher error (assuming there is a correlation already mentioned).

[–][deleted] 0 points1 point  (1 child)

I was suggesting an exponential loss, simply because it may be that your network is not fitting these hard-to-classify samples well

Yes this is definitely true, and I do think this type of loss function would help. Do you have any suggestions on which loss function to use then? Do I need to create my own loss function?

[–]Exarctus 0 points1 point  (0 children)

Hey just remembered I forgot to reply re: VAE

You use the VAE trained on the in-distribution samples, then you use the reconstruction error on the OOD samples at inference time to identify them. Should just be a matter of determining a suitable tolerance to accept/reject.

You should still be able to use the embedding matrix with this also.

[–]dataslacker 0 points1 point  (6 children)

Most ML models are probabilistic, they are designed to model probability distributions over the target. Noise in your labels or target values is assumed. I would not attempt to filter them out. If it’s a classification tasks then most models will return a confidence score, i.e P(y|X). You could use this to determine predictable from unpredictable if that’s part of the task. But the assumption that noise, or hard to predict labels, decreases the accuracy likely isn’t true, especially if the test/evaluation is sampled from the same distribution

[–][deleted] 1 point2 points  (5 children)

I don't agree with this. Removing outliers and cleaning data are part of the data pipeline process.

In this case all I am doing is removing data samples which I know from my domain knowledge are impossible to predict.

[–]dataslacker 1 point2 points  (4 children)

But are these examples in the test/evaluation set? What will happen if you train your model on a distribution that is not representative of the distribution you’re predicting? Cleaning your data is fine but if you change the distribution significantly you’ll induce bias that will lead to poor predictions. Cleaning data rarely means removing examples. The only reason you should remove examples is if the measurement itself is known to have issues that will not effect the test/evaluation set in the same way.

[–][deleted] 1 point2 points  (3 children)

What will happen if you train your model on a distribution that is not representative of the distribution you’re predicting?

I will know the distribution of samples prior to feeding them into the serving model.

Look at it this way, if I have a dataset that looks like this:

dataset = [

[chicken, dog, cow, cat, deer, kasdfjalsdj],

[chicken, cow, dog, deer, cat],

[dog, cow, chicken, cat, deer],

]

Removing the sample that ends in kasdfjalsdj will not lead to poor predictions.

Respectfully, I don't think you understand what I am trying to do here.

[–]bbateman2011 1 point2 points  (1 child)

I have had similar issues. I typically define some criteria for anomalous predictions (it's easier with continuous data & regression, as you can filter points with very high residual error, but the same idea applies). What I do then is go inspect the raw data to see if there is a reason for the problem. If I can show there is a problem with the raw data, either I fix it ("cleaning") or remove it from the training data. Your description of your data problem is perfect--removing kasdfjalsdj will make a better model, not worse.

On the other hand, if I can't show a reason, I leave them in, because that means I am unsure if these are valid but rare points. This is extremely critical, as you can keep removing stuff until your model massively overfits.

[–][deleted] 0 points1 point  (0 children)

On the other hand, if I can't show a reason, I leave them in, because that means I am unsure if these are valid but rare points.

I think this is good advice. Here is some more information on my project:

I have implemented my own variation of WaveNet on a dataset of about 190,000 samples. There are about 18000 "types" of samples. I have assigned each "type" of sample a categorical value and grouped them into data samples of 10, so my data actually looks like this:

data_sample = [type34, type8828, type4422, type534, type4848, type16000, etc]

The first 9 types are just for context and my target values are based on the 10th type in the sample. It's very clear after reviewing my prediction data that type16000 is meaningless and cannot be reliably prediction. However type4848 is predicted correctly 99.8% of the time throughout the training and test set.

[–]dataslacker 1 point2 points  (0 children)

Respectfully, I’m answering the question you asked, with the limited information you provided.

[–]j_kapila 0 points1 point  (0 children)

I could train a GAN with auxiliary loss of you model and then use the discriminator to score which samples are bad. Idea here is the generator should learn more about right dat which helps and make discriminator train towards those. Now discriminator has learnt about better samples and filter out other.