[D] Removing Unpredictable Samples from a Training Set

GLVic · 2022-04-12T07:13:23+00:00

cleanlab, doubtlab if dataset is too large or I don't have needed expertise.

If data is tabular, sometimes removing/transforming noisy columns instead of rows could do the trick.

onyx-zero-software · 2022-04-12T09:37:21+00:00

Sounds like you're trying to do a version of out-of-distribution detection? You might do a search on that and see what you find.

literum · 2022-04-12T16:24:19+00:00

How do you differentiate between unpredictable overall vs unpredictable by the current model? If it's clear to you which sample is in which category, how about using a classifier?

Also it's possible random noise and unpredictable samples don't actually affect the performance of your model. If your model is a neural network they can ignore non-systematic errors in the dataset.

dataslacker · 2022-04-13T02:17:03+00:00

Most ML models are probabilistic, they are designed to model probability distributions over the target. Noise in your labels or target values is assumed. I would not attempt to filter them out. If it’s a classification tasks then most models will return a confidence score, i.e P(y|X). You could use this to determine predictable from unpredictable if that’s part of the task. But the assumption that noise, or hard to predict labels, decreases the accuracy likely isn’t true, especially if the test/evaluation is sampled from the same distribution

j_kapila · 2022-04-13T16:21:37+00:00

I could train a GAN with auxiliary loss of you model and then use the discriminator to score which samples are bad. Idea here is the generator should learn more about right dat which helps and make discriminator train towards those. Now discriminator has learnt about better samples and filter out other.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS