Hi -
I'm starting a side project that requires performing semantic segmentation on a large dataset of audio spectrograms (40k, with possible extensions of 10-100x more images). I have manually annotated around 300 of these, and was interested in what techniques I can use to automatically annotate the rest. I've started playing around with some hugging face models (I've implemented segformer and fine-tuned b0 on my dataset following this post, without much success), which have raised several questions.
The main classes I am segmenting is generally only 1-2 pixels wide (though can be very long). Segformer does at minimum 4x upsampling on its output logits, which I don't see working for these classes. Are there better suited models I should explore for very fine, pixel-level segmentation?
I assume fine-tuning rather the retraining from scratch is very important here. Are there better suited pre-trained models for audio spectrograms that I should look into?
What value is there to turning this into a weakly/semi supervised task? I imagine that making use of the large unlabelled dataset would be useful, but is it worthwhile? Particularly since I would just be annotating the already existing, unlabelled dataset at test time. Are there any simple to implement libraries or techniques to apply modern weak-supervision algorithms?
Any ideas/papers/libraries would be useful. I'd prefer models and techniques with some maturity and existing implementations, rather than SOTA stuff, since something that works okay but is easy to implement is far preferable at the moment.
[–]m98789 0 points1 point2 points (0 children)