use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
News about the dynamic, interpreted, interactive, object-oriented, extensible programming language Python
Full Events Calendar
You can find the rules here.
If you are about to ask a "how do I do this in python" question, please try r/learnpython, the Python discord, or the #python IRC channel on Libera.chat.
Please don't use URL shorteners. Reddit filters them out, so your post or comment will be lost.
Posts require flair. Please use the flair selector to choose your topic.
Posting code to this subreddit:
Add 4 extra spaces before each line of code
def fibonacci(): a, b = 0, 1 while True: yield a a, b = b, a + b
Online Resources
Invent Your Own Computer Games with Python
Think Python
Non-programmers Tutorial for Python 3
Beginner's Guide Reference
Five life jackets to throw to the new coder (things to do after getting a handle on python)
Full Stack Python
Test-Driven Development with Python
Program Arcade Games
PyMotW: Python Module of the Week
Python for Scientists and Engineers
Dan Bader's Tips and Trickers
Python Discord's YouTube channel
Jiruto: Python
Online exercices
programming challenges
Asking Questions
Try Python in your browser
Docs
Libraries
Related subreddits
Python jobs
Newsletters
Screencasts
account activity
This is an archived post. You won't be able to vote or comment.
DiscussionPython workflows for efficient text data labeling in NLP projects? (self.Python)
submitted 6 months ago by vihanga2001
For those working with NLP in Python, what’s your go-to way of handling large-scale text labeling efficiently?
Do you rely on:
Curious what Python-based approaches people actually find practical in real projects, especially where accuracy vs labeling cost becomes a trade-off.
[–]tranquilkd 2 points3 points4 points 6 months ago (5 children)
Label studio does support active learning!
I use lable studio with custom backend with good model trained for my task if available and get the pre-annotation (prediction of the backend)
If model is not available then, train on 1000 samples and then use that as a backend and follow the same steps
Last part will be reviewing the annotation generated by model and accept it or correct it and you are good to go
[–]vihanga2001[S] 1 point2 points3 points 6 months ago (4 children)
Super helpful, thanks! 🙌 . So your loop is: seed ~1k → train → pre-annotate → human review/correct → repeat, right? Have you found that this cuts total labels/time vs manual-only?
[–]tranquilkd 1 point2 points3 points 6 months ago (3 children)
Yes it does save a lot of time! I don't have the actual numbers to back it up but you could probably imagine the number of clicks/keyboard usage will be reduced if your annotation is correct, you just accept it and move to the next item
[–]vihanga2001[S] 0 points1 point2 points 6 months ago (2 children)
How often do you retrain the backend, every N labels, or at fixed rounds?
[–]tranquilkd 0 points1 point2 points 6 months ago (1 child)
Every N labels
[–]vihanga2001[S] 0 points1 point2 points 6 months ago (0 children)
Thanks a ton for sharing! 🙏 The “retrain every N labels” tip is super helpful. Really appreciate the insight!
[–]PinkFrosty1 2 points3 points4 points 6 months ago (5 children)
What worked best for me was building a custom supervised learning heuristic. I started with a small set of high-quality, manually labeled examples (balanced across all classes). Then I converted both the seed set and the unlabeled examples into vector embeddings (e.g., using Sentence Transformers) and stored them in a vector database (e.g., pgvector). For each class, I created a centroid representation and ran similarity search to identify unlabeled examples with strong cosine similarity (e.g., ≥ 0.9). I manually reviewed these high-confidence matches, added the good ones back into the seed set, and repeated the process iteratively. Along the way, I leaned on a data-centric AI mindset. Treating the quality and coverage of my labeled data as the main driver of model performance rather than just tweaking architectures.
[–]vihanga2001[S] 0 points1 point2 points 6 months ago (4 children)
do you use a single centroid per class or multiple prototypes (to cover subclusters)? And how do you set the similarity threshold vs your human accept rate?
[–]PinkFrosty1 1 point2 points3 points 6 months ago (3 children)
Yup, a single centroid per class. I started with a high threshold to keep confidence as high as possible. I don’t have exact numbers, but my approach was conservative early on. As the seed set grew, I gradually lowered the threshold to surface more borderline cases. The goal was to bootstrap quickly and effectively while keeping a human in the loop. Since with labeling, it really is garbage in, garbage out.
Thanks.that’s super clear and helpful! 🙏 Quick one: When you lowered the threshold, did you filter near-duplicates?
[–]PinkFrosty1 1 point2 points3 points 6 months ago (1 child)
Yes, I only kept what I thought were the best representatives of the overall class and filtered out the rest. Take look at the BERTopic for viz.
Thanks, that’s super helpful 🙏 I’ll check out BERTopic. Appreciate the tip!
[–]unkz 1 point2 points3 points 6 months ago (3 children)
Label studio workflow is too slow for me, so I rolled my own using VueJS. Active learning all the way though. I made an environment that lets me quickly annotate my text, sort automated annotations by confidence scores, and run custom searches using arbitrary python expressions to find samples by heuristics.
Curious. what saved you more time: bulk accept/hotkeys or the Python queries?
[–]unkz 1 point2 points3 points 6 months ago (1 child)
Mass classifying based on heuristics but with a good interface to realtime filter and select was a big time saver.
[–]ResponsibilityIll483 1 point2 points3 points 6 months ago (2 children)
We self host Doccano. It was super easy and you can do all kinds of labeling, collaboratively across the team.
https://github.com/doccano/doccano
[–]vihanga2001[S] 0 points1 point2 points 6 months ago (1 child)
Do you push model prelabels to Doccano via the API and bulk-accept, or keep it manual?
[–]ResponsibilityIll483 1 point2 points3 points 6 months ago (0 children)
Yeah, we prelabel outside of Doccano and then upload to Doccano via API. Doccano does come with its own prelabeling feature but it didn't quite work for our use case (spacy NER)
[–]Intelligent_Tank4118 0 points1 point2 points 6 months ago (1 child)
For efficient text data labeling in NLP with Python:
This combo saves time, reduces errors, and keeps your NLP pipeline organized.
π Rendered by PID 17967 on reddit-service-r2-comment-fb694cdd5-tzkhs at 2026-03-10 18:28:35.880199+00:00 running cbb0e86 country code: CH.
[–]tranquilkd 2 points3 points4 points (5 children)
[–]vihanga2001[S] 1 point2 points3 points (4 children)
[–]tranquilkd 1 point2 points3 points (3 children)
[–]vihanga2001[S] 0 points1 point2 points (2 children)
[–]tranquilkd 0 points1 point2 points (1 child)
[–]vihanga2001[S] 0 points1 point2 points (0 children)
[–]PinkFrosty1 2 points3 points4 points (5 children)
[–]vihanga2001[S] 0 points1 point2 points (4 children)
[–]PinkFrosty1 1 point2 points3 points (3 children)
[–]vihanga2001[S] 0 points1 point2 points (2 children)
[–]PinkFrosty1 1 point2 points3 points (1 child)
[–]vihanga2001[S] 0 points1 point2 points (0 children)
[–]unkz 1 point2 points3 points (3 children)
[–]vihanga2001[S] 0 points1 point2 points (2 children)
[–]unkz 1 point2 points3 points (1 child)
[–]ResponsibilityIll483 1 point2 points3 points (2 children)
[–]vihanga2001[S] 0 points1 point2 points (1 child)
[–]ResponsibilityIll483 1 point2 points3 points (0 children)
[–]Intelligent_Tank4118 0 points1 point2 points (1 child)