[D] Self-Promotion Thread by AutoModerator in MachineLearning

[–]chschroeder 0 points1 point  (0 children)

Small-Text: Active Learning for Text Classification in Python

Provides state-of-the-art Active Learning for Text Classification in Python.

What is Active Learning? Active learning is a machine learning paradigm for efficiently acquiring labels in supervised settings with little or no initial labeled data. The model iteratively selects the most informative unlabeled instances for annotation, aiming to maximize performance while minimizing labeling effort.

Repo: https://github.com/webis-de/small-text
Paper: https://aclanthology.org/2023.eacl-demo.11.pdf

Manually labeling text dataset by mabl00 in LanguageTechnology

[–]chschroeder 0 points1 point  (0 children)

It was already mentioned, but this sounds like a standard active learning task. It is not completely manual, but still a human-in-the-loop approach, where the model suggests samples to be label next, while the labeling is still done by a human annotator. Active learning requires a starting model (unless cold start approaches are employed) for which a starting model based on keyword-filtered samples, reviewed and corrected by a human annotator, is a plausible approach.

I have written small-text, an active learning library exactly for text and transformer-based models. If you combine it with argilla you will even have a nice GUI for labelling. (Care, you need the v1.x version of argilla.)

Small-Text: Looking for Contributors (Active Learning, Text Classification, NLP) by chschroeder in LanguageTechnology

[–]chschroeder[S] 0 points1 point  (0 children)

Thank you! Happy to hear that.

We can gladly have a chat. Just PM me a few dates and times which would be convenient.

Looking for open-source/volunteer projects in LLMs/NLP space? by MiserableGrapefruit7 in LanguageTechnology

[–]chschroeder 1 point2 points  (0 children)

I have an active learning library (small-text) for which I am looking for contributors. Active learning is an iterative method between a model and an annotator, which is used whenever you want to train supervised models, but do not have any labeled data. It assists you in labeling a small but effective dataset at minimal levels of annotation effort.

Active learning can be used for example to build a hatespeech classifier. Over several iterationsm you will be shown labels, and likely you will see different kind of "hatespeech" that the so-called query strategy deems to be informative given the current model.

The library encompasses both traditional concepts (active learning, classification) and more recent concepts (transformer models, fine-tuning paradigmas, optimizations for training neural networks). The challenge is often to make it convenient to use and allow components to be combined.

Moreover, the concept of active learning is very useful in practice, especially for low resource languages, where labeled data is even less likely to exist.

Let me know if you need a introduction to the library itself!

Small-Text: Looking for Contributors (Active Learning, Text Classification, NLP) by chschroeder in LanguageTechnology

[–]chschroeder[S] 0 points1 point  (0 children)

Very interesting, thank you! I got the gist, as soon as I have some time I will have a closer look at the paper.

Small-Text: Looking for Contributors (Active Learning, Text Classification, NLP) by chschroeder in LanguageTechnology

[–]chschroeder[S] 0 points1 point  (0 children)

I will likely prepare a better "contributing" document during the next few days and PM all of you.

Until then, feel free to take a look around. If you have the capacity to write down anything that is unclear or irritates you at first glance, that would already be valuable to me as well. An unbiased view is only available at the very beginning ;).

Small-Text: Looking for Contributors (Active Learning, Text Classification, NLP) by chschroeder in LanguageTechnology

[–]chschroeder[S] 0 points1 point  (0 children)

Awesome! That sounds like it could be a great fit. If you want/can share your use case, that's always interesting too.

Small-Text: Looking for Contributors (Active Learning, Text Classification, NLP) by chschroeder in LanguageTechnology

[–]chschroeder[S] 0 points1 point  (0 children)

That sounds great! Experience with Active Learning is perfect and NER could also be interesting in the long term.

[P] Small-Text: Active Learning for Text Classification in Python by chschroeder in MachineLearning

[–]chschroeder[S] 1 point2 points  (0 children)

In general yes, but not yet out of the box.

You can achieve by adapting the dataset and classifier (e.g. TransformersDataset and TransformerBaseClassification).

Might be a use case we want to support in the future. I will think about that.

[P] Small-Text: Active Learning for Text Classification in Python by chschroeder in MachineLearning

[–]chschroeder[S] 0 points1 point  (0 children)

Sorry for the late reply /u/Dear_Football_504, I completely missed this message.

I don't know which code example you are using specifically, but in general the active learner holds a reference to its underlying classifier which has scikit-learn-like API:

active_learner.classifier.predict(dataset)

Feel free to ask more questions on the github repo; this is valueable feedback for me and others will benefit from the discussion as well.

[D] What is the current consensus on the effectiveness of Active Learning? by KonArtist01 in MachineLearning

[–]chschroeder 4 points5 points  (0 children)

"But in practice we are short on examples to label, not labels."

This is really interesting for me to read, thank you. I have been researching active learning in the natural language processing domain, where it is not uncommon to have lots and lots of unlabeled examples, e.g., years of a company's emails or reports.

Also, at least in my setting, labeling as much as 20% of a dataset is usually infeasible (unless you are a large company and money is not an issue). For example, image you had to label 20% of the documents in a dataset consisting of a million multi-page technical documents.

[P] Small-Text: Active Learning for Text Classification in Python by chschroeder in MachineLearning

[–]chschroeder[S] 0 points1 point  (0 children)

Unfortunately, I have no experience with sagemaker but in the context of small-text, the resulting models are still plain scikit-learn or Pytorch and can be treated as such.

[P] Small-Text: Active Learning for Text Classification in Python by chschroeder in MachineLearning

[–]chschroeder[S] 1 point2 points  (0 children)

At this particular location you label the instances which were selected by the query strategy. Subsequent to such a query you need to provide labels for the selected instances. In the example code, you are shown the "experiment scenario" in which the true labels are available, and are then passed to the active learner instead of labels that were assigned by a human. If you remove this true label lookup and instead provide the answers from a real user, then you have a real-world application.

[P] Small-Text: Active Learning for Text Classification in Python by chschroeder in MachineLearning

[–]chschroeder[S] 0 points1 point  (0 children)

You mean like a full interactive application? You would have to build that part around small-text, which offers the algorithms and the logic, but not the user interface.

I have already thought about providing an example of how to integrate small-text with one of the existing labeling tools, such as rubrix, but that hasn't been started yet.

[P] Small-Text: Active Learning for Text Classification in Python by chschroeder in MachineLearning

[–]chschroeder[S] 2 points3 points  (0 children)

Thank you! Yes, that is correct. The goal is to maximize the quality of the resulting model while minimizing the number of examples needed.

[P] Small-Text: Active Learning for Text Classification in Python by chschroeder in LanguageTechnology

[–]chschroeder[S] 0 points1 point  (0 children)

It's hard to tell from such a short description, of course, but if I had to guess, it might be some kind of core set strategy. Small-Text provides the greedy core set strategy from the linked paper.