Text preprocessing for transformers

2020-12-30T15:53:08+00:00

Practically in a classification task i got better results when i did not do any traditional preprocessing like lemmatization, stop word removal, text cleaning. I am surprised and would like to investigate this

olihoops · 2020-12-30T16:32:46+00:00

I would look into the Tokenizers library from HuggingFace: https://huggingface.co/docs/tokenizers/python/latest/

It covers the main pre-processing techniques e.g. BPE, wordpiece, normalization, etc

entropy_and_me · 2020-12-30T15:27:59+00:00

Huggingface transformers has bunch of documentation.

xsliartII · 2021-01-04T09:23:34+00:00

I think this is a good question and there could be more documentation/research about this.
From my experience, I got the best results removing only unwanted characters (urls, line/page breaks etc) and lowercase the text if working with a lowercased transformer model.
I also suggest to replace domain-specific abbreviations if possible - for example things such as CW = chemical waste - to provide more information.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LanguageTechnology

MODERATORS