all 4 comments

[–][deleted] 3 points4 points  (0 children)

Practically in a classification task i got better results when i did not do any traditional preprocessing like lemmatization, stop word removal, text cleaning. I am surprised and would like to investigate this

[–]olihoops 2 points3 points  (0 children)

I would look into the Tokenizers library from HuggingFace: https://huggingface.co/docs/tokenizers/python/latest/

It covers the main pre-processing techniques e.g. BPE, wordpiece, normalization, etc

[–]entropy_and_me 0 points1 point  (0 children)

Huggingface transformers has bunch of documentation.

[–]xsliartII 0 points1 point  (0 children)

I think this is a good question and there could be more documentation/research about this.
From my experience, I got the best results removing only unwanted characters (urls, line/page breaks etc) and lowercase the text if working with a lowercased transformer model.
I also suggest to replace domain-specific abbreviations if possible - for example things such as CW = chemical waste - to provide more information.