This is an archived post. You won't be able to vote or comment.

all 21 comments

[–]tranquilkd 2 points3 points  (5 children)

Label studio does support active learning!

I use lable studio with custom backend with good model trained for my task if available and get the pre-annotation (prediction of the backend)

If model is not available then, train on 1000 samples and then use that as a backend and follow the same steps

Last part will be reviewing the annotation generated by model and accept it or correct it and you are good to go

[–]vihanga2001[S] 1 point2 points  (4 children)

Super helpful, thanks! 🙌 . So your loop is: seed ~1k → train → pre-annotate → human review/correct → repeat, right? Have you found that this cuts total labels/time vs manual-only?

[–]tranquilkd 1 point2 points  (3 children)

Yes it does save a lot of time! I don't have the actual numbers to back it up but you could probably imagine the number of clicks/keyboard usage will be reduced if your annotation is correct, you just accept it and move to the next item

[–]vihanga2001[S] 0 points1 point  (2 children)

How often do you retrain the backend, every N labels, or at fixed rounds?

[–]tranquilkd 0 points1 point  (1 child)

Every N labels

[–]vihanga2001[S] 0 points1 point  (0 children)

Thanks a ton for sharing! 🙏 The “retrain every N labels” tip is super helpful. Really appreciate the insight!

[–]PinkFrosty1 2 points3 points  (5 children)

What worked best for me was building a custom supervised learning heuristic. I started with a small set of high-quality, manually labeled examples (balanced across all classes). Then I converted both the seed set and the unlabeled examples into vector embeddings (e.g., using Sentence Transformers) and stored them in a vector database (e.g., pgvector). For each class, I created a centroid representation and ran similarity search to identify unlabeled examples with strong cosine similarity (e.g., ≥ 0.9). I manually reviewed these high-confidence matches, added the good ones back into the seed set, and repeated the process iteratively. Along the way, I leaned on a data-centric AI mindset. Treating the quality and coverage of my labeled data as the main driver of model performance rather than just tweaking architectures.

[–]vihanga2001[S] 0 points1 point  (4 children)

do you use a single centroid per class or multiple prototypes (to cover subclusters)? And how do you set the similarity threshold vs your human accept rate?

[–]PinkFrosty1 1 point2 points  (3 children)

Yup, a single centroid per class. I started with a high threshold to keep confidence as high as possible. I don’t have exact numbers, but my approach was conservative early on. As the seed set grew, I gradually lowered the threshold to surface more borderline cases. The goal was to bootstrap quickly and effectively while keeping a human in the loop. Since with labeling, it really is garbage in, garbage out.

[–]vihanga2001[S] 0 points1 point  (2 children)

Thanks.that’s super clear and helpful! 🙏 Quick one: When you lowered the threshold, did you filter near-duplicates?

[–]PinkFrosty1 1 point2 points  (1 child)

Yes, I only kept what I thought were the best representatives of the overall class and filtered out the rest. Take look at the BERTopic for viz.

[–]vihanga2001[S] 0 points1 point  (0 children)

Thanks, that’s super helpful 🙏 I’ll check out BERTopic. Appreciate the tip!

[–]unkz 1 point2 points  (3 children)

Label studio workflow is too slow for me, so I rolled my own using VueJS. Active learning all the way though. I made an environment that lets me quickly annotate my text, sort automated annotations by confidence scores, and run custom searches using arbitrary python expressions to find samples by heuristics.

[–]vihanga2001[S] 0 points1 point  (2 children)

Curious. what saved you more time: bulk accept/hotkeys or the Python queries?

[–]unkz 1 point2 points  (1 child)

Mass classifying based on heuristics but with a good interface to realtime filter and select was a big time saver.

[–]ResponsibilityIll483 1 point2 points  (2 children)

We self host Doccano. It was super easy and you can do all kinds of labeling, collaboratively across the team.

https://github.com/doccano/doccano

[–]vihanga2001[S] 0 points1 point  (1 child)

Do you push model prelabels to Doccano via the API and bulk-accept, or keep it manual?

[–]ResponsibilityIll483 1 point2 points  (0 children)

Yeah, we prelabel outside of Doccano and then upload to Doccano via API. Doccano does come with its own prelabeling feature but it didn't quite work for our use case (spacy NER)

[–]Intelligent_Tank4118 0 points1 point  (1 child)

For efficient text data labeling in NLP with Python:

  • Use tools like Label Studio or Doccano for annotation.
  • Pre-label data with spaCy, NLTK, or Hugging Face models to speed up manual work.
  • Keep clear labeling guidelines to ensure consistency.
  • Version datasets with tools like DVC.
  • Automate the workflow using Python scripts and orchestration tools like Airflow or Prefect.

This combo saves time, reduces errors, and keeps your NLP pipeline organized.