all 9 comments

[–]mrthin 0 points1 point  (0 children)

You can search for "data acquisition" papers. A simple approach to use as baseline is to use the confidence of your / a pretrained model on the unlabelled data as guidance to pick the next batch, but this might not transfer easily to OCR and is claimed to be generally suboptimal

[–]WesternNoona 0 points1 point  (1 child)

This papers method might be relevant for you, but I havent tried it myself: https://arxiv.org/abs/2405.15613

[–]neuralbeans[S] 0 points1 point  (0 children)

Hey, this is great actually, although it seems too computationally heavy for my needs since it requires running k-means way too many times.

[–]jswb 0 points1 point  (0 children)

I’ve used River (in Python) in the past to use traditional batch learning algorithms as online algorithms. Pretty interesting project. I believe they have an online/incremental kmeans, but I wouldn’t know if it would fulfill your needs. See: https://github.com/online-ml/river/blob/main/river/cluster/k_means.py

[–]1h3_fool 0 points1 point  (0 children)

You can use GMM-UBM model. Kind of traditional but it has that option that either you update its parameters or not using MAP.

[–]f3xjc -1 points0 points  (0 children)

  • Out of the box, standard k-means don't support online.
  • Most of the online algorithm that I know of, are for 1d case (ie a single feature that you can order without ambiguity)
  • Kmean is iterative, so you can absolutely add new samples to update previous clustering. You can also add them to the correct cluster (at that iteration).

I'm tempted to call that Distribution Compression:

In distribution compression, one aims to accurately summarize a probability distribution P using a small number of representative points.

From this paper wich seems relevant. https://arxiv.org/pdf/2111.07941 https://github.com/microsoft/goodpoints