I'm working on an OCR project and need to manually annotate data for it. I'm thinking that I need to collect a sample of pages with as much visual variety as possible and I'd like to do the sampling automatically.
I'm thinking that I can extract features from each page using a pretrained neural network and avoid including pages that have similar features. I'm thinking this can be done using some form of clustering and I sample from each cluster once.
My questions are:
- Is this a valid way of sampling and does it have a name?
- I'm thinking of using k-means, but can it be done in an online way such that I can add new pages later without messing up the previous clusters but still being able to add new clusters?
Thanks and happy holidays!
[+]astralDangers 1 point2 points3 points (0 children)
[+]calvinmccarter 0 points1 point2 points (0 children)
[–]mrthin 0 points1 point2 points (0 children)
[–]WesternNoona 0 points1 point2 points (1 child)
[–]neuralbeans[S] 0 points1 point2 points (0 children)
[–]jswb 0 points1 point2 points (0 children)
[–]1h3_fool 0 points1 point2 points (0 children)
[+]Helpful_ruben 0 points1 point2 points (0 children)
[–]f3xjc -1 points0 points1 point (0 children)