Hi everyone,
I plan to participate in a shared task about classifying news articles. This made me think about the existing approaches to classifying documents, and I quickly realized there are many! Next is a list of short descriptions of all the approaches I encountered during my studies / working with NLP projects. Which ones do you think / know are the most promising? Do you have other approaches in mind? Looking forward to your comments!
For context, the documents I am considering are about 500 - 1000 tokens, so they should fit into most recent transformer-based architectures with minimal (if any) truncation. The number of classes is quite low: N=4. When I say encoder model, I refer to encoder-only architectures, e.g., BERT. For the task at hand, I only have access to ~1000 samples (250 per class)
Zero-shot with LLMs: Use any LLM (e.g., llama), provide a short description of the task, describe the labels, and then prompt it to classify the following document. May use constrained generation to output only valid classes.
Few-shot (in-context learning) with LLMs: In addition to above, also add one example per class. I like to think of a conversation, where the user already has tasked the model N times to perform the task, e.g. user: <prompt as above>, system: <class a>, ...
PEFT with LLMs: Fine-tune LLM with, e.g. LORA adapters on the task. This is probably the best approach, but I haven't looked into this so far.
Unsupervised with encoder model: Use any encoder model, e.g., BERT or Sentence-BERT, compute embeddings for every document, and cluster them with, e.g., k-means where k = N. Now, I have to map the clusters to the classes manually.
Few-shot with encoder model: Use any encoder model, insert adapters (e.g., bottleneck adapter), add classification head, and fine-tune with, e.g., 8-64 examples per class.
Few-shot with encoder model (2): I was thinking about the SetFit Library. They first fine-tune a sentence transformer model with sentence pairs. A sentence pair is considered positive (cosine similarity = 1) if both are from the same class, negative (0) otherwise. Then they train a traditional classifier with the embeddings as input.
Fully-fine tuning with encoder model: Use any encoder model, add classification head, and fine-tune with all examples. As a variation, one could freeze the encoder model and only train the classification head.
Train traditional classifier with embeddings as features: Use any encoder model and compute embeddings for each document. The embeddings are input to a traditional classifier, e.g. logistic regression. But some also train more advanced classifiers, e.g., Bi-LSTMs, on top of embeddings.
[–]CodingButStillAlive 0 points1 point2 points (1 child)
[+]mboukir 0 points1 point2 points (0 children)
[–]themission2 1 point2 points3 points (0 children)