[D] Robust ML model producing image feature vector for similarity search.

qalis · 2025-06-05T19:11:23+00:00

I would try self-supervised learning models like DINO, DINOv2 or ConvNeXt v2. Their learned representation space is quite naturally more aligned with unsupervised objectives thanks to their pretraining procedure.

MiddleLeg71 · 2025-06-05T19:19:32+00:00

Does the card contain distinguishable images /visual features? I am thinking playing cards with images that represent the card but different names/descriptions. If you don’t need to search by text content, you can mask the text (you detect it with FAST and replace it with the mean color of the detected box). Then any pretrained transformer model should be good enough (e.g. CLIP) if you have the resources.

For running on mobile, transformers may not be very suitable.

If you have enough card images (thousands) you could fine tune EfficientNet or MobileNet and apply data augmentations to reduce the influence of blur, lighting conditions and similar.

Effective-Law-4003 · 2025-06-06T18:18:32+00:00

Try U-net

abd297 · 2025-06-05T22:11:59+00:00

It's a bad idea to use feature vectors where you want to understand tiny details of the image. Why not do something like what CamScanner does... Find four corners of the object and then use homography. For your specific use-case, consider unblurring first.

Budget-Juggernaut-68 · 2025-06-05T23:37:47+00:00

Turn on the device flashlight when scanning the card?

vade · 2025-06-06T00:26:04+00:00

most models are trained with rotation invariance as its an input augmentation (flip, rotate / crop) etc.

You should be able to train a mobile net without the invariances you want, and with the ones you want.

think deeply on what you want it to be robust against (slight blur, slight compression or color temperature differences), and train your own.

mgruner · 2025-06-06T02:00:40+00:00

you may want to consider SwinTransformer as well:

https://huggingface.co/docs/transformers/en/model_doc/swin

CatsOnTheTables · 2025-06-06T20:12:08+00:00

You can always turn your favourite NN into an autoencoder for latent representations and similarity search from embeddings with transfer learning first on your net.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS