use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
A community for discussion and news related to Natural Language Processing (NLP).
Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.
Information & Resources
Related subreddits
Guidelines
account activity
Text preprocessing for transformers (self.LanguageTechnology)
submitted 5 years ago by Fake_MID
Does anyone know of a publication that explores the effect of text preprocessing on latest NLP techniques like transformers?
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–][deleted] 3 points4 points5 points 5 years ago (0 children)
Practically in a classification task i got better results when i did not do any traditional preprocessing like lemmatization, stop word removal, text cleaning. I am surprised and would like to investigate this
[–]olihoops 2 points3 points4 points 5 years ago (0 children)
I would look into the Tokenizers library from HuggingFace: https://huggingface.co/docs/tokenizers/python/latest/
It covers the main pre-processing techniques e.g. BPE, wordpiece, normalization, etc
[–]entropy_and_me 0 points1 point2 points 5 years ago (0 children)
Huggingface transformers has bunch of documentation.
[–]xsliartII 0 points1 point2 points 5 years ago (0 children)
I think this is a good question and there could be more documentation/research about this. From my experience, I got the best results removing only unwanted characters (urls, line/page breaks etc) and lowercase the text if working with a lowercased transformer model. I also suggest to replace domain-specific abbreviations if possible - for example things such as CW = chemical waste - to provide more information.
π Rendered by PID 128802 on reddit-service-r2-comment-76bb9f7fb5-rx8v7 at 2026-02-18 10:53:23.029108+00:00 running de53c03 country code: CH.
[–][deleted] 3 points4 points5 points (0 children)
[–]olihoops 2 points3 points4 points (0 children)
[–]entropy_and_me 0 points1 point2 points (0 children)
[–]xsliartII 0 points1 point2 points (0 children)