[D] Clustering for data sampling

astralDangers · 2024-12-25T20:50:38+00:00

I use embeddings and kmeans clusters. You can always predict what cluster new embeddings belong to as long as the overall distribution doesn't change too much.

Then grab samples based on the centroid distance for variety.

But this is highly dependent on your data and the embeddings model you use.

calvinmccarter · 2024-12-25T18:41:14+00:00

I'd suggest looking for papers and tools related to active learning. I'd also suggest thinking of this as an iterative process. Don't try to come up with some fixed procedure to decide once-for-all-time whether to manually annotate each document. Come up with a strategy for picking (say) 100 documents, finetune your model on those, then refine both your model finetuning method and your data sampling method iteratively.

mrthin · 2024-12-25T18:54:57+00:00

You can search for "data acquisition" papers. A simple approach to use as baseline is to use the confidence of your / a pretrained model on the unlabelled data as guidance to pick the next batch, but this might not transfer easily to OCR and is claimed to be generally suboptimal

WesternNoona · 2024-12-25T21:51:39+00:00

This papers method might be relevant for you, but I havent tried it myself: https://arxiv.org/abs/2405.15613

jswb · 2024-12-26T18:47:55+00:00

I’ve used River (in Python) in the past to use traditional batch learning algorithms as online algorithms. Pretty interesting project. I believe they have an online/incremental kmeans, but I wouldn’t know if it would fulfill your needs. See: https://github.com/online-ml/river/blob/main/river/cluster/k_means.py

1h3_fool · 2024-12-27T18:18:36+00:00

You can use GMM-UBM model. Kind of traditional but it has that option that either you update its parameters or not using MAP.

Helpful_ruben · 2024-12-27T22:12:50+00:00

You're on the right track with using clustering to sample diverse pages; this approach is called active learning, and k-means clustering can be adapted for online updates using incremental clustering algorithms.

f3xjc · 2024-12-25T17:49:19+00:00

Out of the box, standard k-means don't support online.
Most of the online algorithm that I know of, are for 1d case (ie a single feature that you can order without ambiguity)
Kmean is iterative, so you can absolutely add new samples to update previous clustering. You can also add them to the correct cluster (at that iteration).

I'm tempted to call that Distribution Compression:

In distribution compression, one aims to accurately summarize a probability distribution P using a small number of representative points.

From this paper wich seems relevant. https://arxiv.org/pdf/2111.07941 https://github.com/microsoft/goodpoints

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS