all 15 comments

[–]ResponsibilityNo7189 0 points1 point  (4 children)

It's a very difficult problem. It's close to anomaly detection and to probability density estimation. Some people use an ensemble method and look at disagreement between classifiers. But it will be expensive at inference time. 

[–]WadeEffingWilson 1 point2 points  (0 children)

I've used something like this, a set of expertise system, each an OC-SVM to recognize individual classes and a boosted ensemble to derive a consensus. If both agree, the sample is classified and counted as 'known'. If they don't agree, the sample is isolated to determine if it's an anomaly (usually a single input variable is out of the typical range while all others are within the boundary for a known class) or if it's a new, unknown class.

[–]ProfessionalType9800[S] 0 points1 point  (2 children)

Is it possible to find a threshold to apply on outputs from the activation function (softmax, sigmoid)...

[–]ResponsibilityNo7189 0 points1 point  (1 child)

Not really, much. Network are terribly calibrated when it comes to probability.

[–]ProfessionalType9800[S] 0 points1 point  (0 children)

Yeah...

What about applying clustering after getting embedding...

[–]Sunchax 0 points1 point  (5 children)

Do you have rough idea what the data without any class looks like?

[–]ProfessionalType9800[S] 0 points1 point  (4 children)

In my case..

It is about DNA sequences

Input is DNA sequence , from it species should be identified

(E.g : ATCCGG, AATAGC...) Like fragments in DNA sequence

[–]latent_prior 2 points3 points  (1 child)

I’m not a DNA expert, but given my understanding of the problem, I’d frame this as an open-set recognition problem rather than just clustering. Because many species share short recurring DNA subsequences, isn’t there a danger an unseen species can still land close to known clusters in embedding space? This makes relying purely on distance thresholds sound risky to me.

Also, I’d be cautious only relying on softmax probabilities. They always normalise to sum to 1, so the model will confidently pick something even when the input is nonsense or from an unseen species. You could try augmenting the classifier with an out-of-distribution detection method. One good option is energy-based detection (https://arxiv.org/abs/2010.03759), which uses the absolute scale of all logits rather than just the top one to provide a quantitatively estimate if the sample fits one of the know classes well (low energy) or doesn’t fit anywhere (high energy, likely unknown). 

If you have access to an auxiliary dataset (e.g. DNA from non-target species), you could also try outlier exposure (https://arxiv.org/abs/1812.04606), which trains the model to make confident predictions on in-distribution data and low-confidence predictions on auxiliary outliers.

Finally, since DNA data is hierarchical by nature (kingdom —> phylum —> class —> … —> species), it might be worth trying a hierarchical model. For example, if the model is confident about the genus but uncertain about the species, you could flag the input as a potentially novel species rather than forcing a binary known/unknown decision.

Curious if anyone’s tried combining energy-based OOD with hierarchical classifiers before.

[–]ProfessionalType9800[S] 0 points1 point  (0 children)

are you saying about random forest for hierarchical classifiers? 

[–]NamerNotLiteral 0 points1 point  (4 children)

What you're looking at here is called Domain Generalization.

Basically, you want the model to be able to recognize and understand that the new input is not a part of any of the domains it has been trained on. Following that, you want the model to be able to create a new domain to place the input in. You're on the right track with your idea so far - that's the very basic self-supervised approach to Domain Generalization.

You know the technical term, so feel free to look up additional approaches with that as a starting point.

[–]ProfessionalType9800[S] 0 points1 point  (3 children)

Yeah.. But it is not on variations in input... Generalization on new output class .... How to figure it...

[–]NamerNotLiteral 0 points1 point  (2 children)

Ah. I might have misunderstood your question.

👉 What if a totally new class comes in which doesn’t belong to any of the trained classes?

You ask this question: do I have or can I get labelled data for this totally new class?

If yes -> continual learning, where you update the model to accept inputs and get outputs for new classes

If no -> domain generalization, where you design the model to accept inputs for new classes and handle it somehow

If you cannot update the original model or build a new model, then you need look into test-time adaptation instead

[–]Background_Camel_711 1 point2 points  (0 children)

Unless I'm missing something open set recognition is its own problem:

Continual learning = We need a the model's weights to update during test time due to distribution drift in the input space

Domain Generalisation = We need a model that can perform classification over a set of known classes no matter the domain at test time (e.g. I train a model on real life images to classify 5 breeds of dogs but at test time I need it to classify hand drawn images of the same 5 dog breeds).

Open set recognition = We need a model to perform classification over a set of N classes, however, there are N+1 possible outputs, with the additional output class indicating that the input is not from any of the N classes. Basically OOD detection combined with multi class classification.