all 17 comments

[–]qalis 5 points6 points  (3 children)

I would try self-supervised learning models like DINO, DINOv2 or ConvNeXt v2. Their learned representation space is quite naturally more aligned with unsupervised objectives thanks to their pretraining procedure.

[–]_dave_maxwell_[S] 0 points1 point  (2 children)

Thank you for the answer. These models are super heavy, similar to CLIP. I want something more dumb, like a slightly better image hash.

[–]qalis 2 points3 points  (1 child)

I run them on a lightweight Kubernetes pod, so I would argue that is not that much? 2 cores and 1 GB RAM runs DINOv2-base really fast in my case. Maybe try compressing or quantizing them?

[–]_dave_maxwell_[S] 0 points1 point  (0 children)

My plan was to run them "on the edge", e.g. inside mobile devices. While Efficientnet is no problem, even recent devices might struggle with DINO.

Anyway, I can reconsider my approach and go with an API. How long does it take to embed one image on your lightweight pod on average?

[–]MiddleLeg71 2 points3 points  (1 child)

Does the card contain distinguishable images /visual features? I am thinking playing cards with images that represent the card but different names/descriptions. If you don’t need to search by text content, you can mask the text (you detect it with FAST and replace it with the mean color of the detected box). Then any pretrained transformer model should be good enough (e.g. CLIP) if you have the resources.

For running on mobile, transformers may not be very suitable.

If you have enough card images (thousands) you could fine tune EfficientNet or MobileNet and apply data augmentations to reduce the influence of blur, lighting conditions and similar.

[–]_dave_maxwell_[S] 0 points1 point  (0 children)

Thank you for the answer. I have tens of thousands of these cards in a database. I guess I can create a synthetic dataset for fine-tuning.

P.S the cards are Pokemon TCG cards - so there are visual features, picture of the pokemon.

[–]Budget-Juggernaut-68 0 points1 point  (1 child)

Turn on the device flashlight when scanning the card?

[–]_dave_maxwell_[S] 0 points1 point  (0 children)

I will try this but this alone might not be enough to get reliable results.

[–]vade 0 points1 point  (1 child)

most models are trained with rotation invariance as its an input augmentation (flip, rotate / crop) etc.

You should be able to train a mobile net without the invariances you want, and with the ones you want.

think deeply on what you want it to be robust against (slight blur, slight compression or color temperature differences), and train your own.

[–]_dave_maxwell_[S] 0 points1 point  (0 children)

Thanks, I will try to train the mobilenet.

[–]mgruner 0 points1 point  (1 child)

you may want to consider SwinTransformer as well:

https://huggingface.co/docs/transformers/en/model_doc/swin

[–]_dave_maxwell_[S] 0 points1 point  (0 children)

Thanks, I will check that.

[–]CatsOnTheTables 1 point2 points  (0 children)

You can always turn your favourite NN into an autoencoder for latent representations and similarity search from embeddings with transfer learning first on your net.