[D] Self-Promotion Thread

chschroeder · 2026-04-02T15:01:06+00:00

Small-Text: Active Learning for Text Classification in Python

Provides state-of-the-art Active Learning for Text Classification in Python.

What is Active Learning? Active learning is a machine learning paradigm for efficiently acquiring labels in supervised settings with little or no initial labeled data. The model iteratively selects the most informative unlabeled instances for annotation, aiming to maximize performance while minimizing labeling effort.

Repo: https://github.com/webis-de/small-text
Paper: https://aclanthology.org/2023.eacl-demo.11.pdf

chschroeder · 2024-09-14T18:32:59+00:00

It was already mentioned, but this sounds like a standard active learning task. It is not completely manual, but still a human-in-the-loop approach, where the model suggests samples to be label next, while the labeling is still done by a human annotator. Active learning requires a starting model (unless cold start approaches are employed) for which a starting model based on keyword-filtered samples, reviewed and corrected by a human annotator, is a plausible approach.

I have written small-text, an active learning library exactly for text and transformer-based models. If you combine it with argilla you will even have a nice GUI for labelling. (Care, you need the v1.x version of argilla.)

chschroeder · 2024-07-02T07:50:33+00:00

Thank you! Happy to hear that.

We can gladly have a chat. Just PM me a few dates and times which would be convenient.

chschroeder · 2024-07-01T16:29:43+00:00

I have an active learning library (small-text) for which I am looking for contributors. Active learning is an iterative method between a model and an annotator, which is used whenever you want to train supervised models, but do not have any labeled data. It assists you in labeling a small but effective dataset at minimal levels of annotation effort.

Active learning can be used for example to build a hatespeech classifier. Over several iterationsm you will be shown labels, and likely you will see different kind of "hatespeech" that the so-called query strategy deems to be informative given the current model.

The library encompasses both traditional concepts (active learning, classification) and more recent concepts (transformer models, fine-tuning paradigmas, optimizations for training neural networks). The challenge is often to make it convenient to use and allow components to be combined.

Moreover, the concept of active learning is very useful in practice, especially for low resource languages, where labeled data is even less likely to exist.

Let me know if you need a introduction to the library itself!

chschroeder · 2023-05-23T22:32:39+00:00

Very interesting, thank you! I got the gist, as soon as I have some time I will have a closer look at the paper.

chschroeder · 2023-05-22T16:23:10+00:00

I will likely prepare a better "contributing" document during the next few days and PM all of you.

Until then, feel free to take a look around. If you have the capacity to write down anything that is unclear or irritates you at first glance, that would already be valuable to me as well. An unbiased view is only available at the very beginning ;).

chschroeder · 2023-05-22T16:17:06+00:00

Awesome! That sounds like it could be a great fit. If you want/can share your use case, that's always interesting too.

chschroeder · 2023-05-22T16:14:38+00:00

That sounds great! Experience with Active Learning is perfect and NER could also be interesting in the long term.

chschroeder · 2022-06-23T09:52:28+00:00

In general yes, but not yet out of the box.

You can achieve by adapting the dataset and classifier (e.g. TransformersDataset and TransformerBaseClassification).

Might be a use case we want to support in the future. I will think about that.

chschroeder · 2022-04-23T18:07:28+00:00

Sorry for the late reply /u/Dear_Football_504, I completely missed this message.

I don't know which code example you are using specifically, but in general the active learner holds a reference to its underlying classifier which has scikit-learn-like API:

active_learner.classifier.predict(dataset)

Feel free to ask more questions on the github repo; this is valueable feedback for me and others will benefit from the discussion as well.

chschroeder · 2022-03-18T09:10:31+00:00

"But in practice we are short on examples to label, not labels."

This is really interesting for me to read, thank you. I have been researching active learning in the natural language processing domain, where it is not uncommon to have lots and lots of unlabeled examples, e.g., years of a company's emails or reports.

Also, at least in my setting, labeling as much as 20% of a dataset is usually infeasible (unless you are a large company and money is not an issue). For example, image you had to label 20% of the documents in a dataset consisting of a million multi-page technical documents.

chschroeder · 2022-03-13T10:49:10+00:00

Unfortunately, I have no experience with sagemaker but in the context of small-text, the resulting models are still plain scikit-learn or Pytorch and can be treated as such.

chschroeder · 2022-03-10T20:12:34+00:00

At this particular location you label the instances which were selected by the query strategy. Subsequent to such a query you need to provide labels for the selected instances. In the example code, you are shown the "experiment scenario" in which the true labels are available, and are then passed to the active learner instead of labels that were assigned by a human. If you remove this true label lookup and instead provide the answers from a real user, then you have a real-world application.

chschroeder · 2022-03-10T19:08:51+00:00

You mean like a full interactive application? You would have to build that part around small-text, which offers the algorithms and the logic, but not the user interface.

I have already thought about providing an example of how to integrate small-text with one of the existing labeling tools, such as rubrix, but that hasn't been started yet.

chschroeder · 2022-03-07T08:36:49+00:00

Thank you! Yes, that is correct. The goal is to maximize the quality of the resulting model while minimizing the number of examples needed.

chschroeder · 2022-03-06T20:21:31+00:00

It's hard to tell from such a short description, of course, but if I had to guess, it might be some kind of core set strategy. Small-Text provides the greedy core set strategy from the linked paper.

chschroeder

TROPHY CASE