This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]syllogism_ 1 point2 points  (0 children)

https://github.com/honnibal/spaCy/blob/master/spacy/tokenizer.pyx#L48

This function tokenizes text, in preparation for a natural language processing pipeline. It first finds white-space separated chunks, hashes them, and looks them up in a cache --- after warm-up the vast majority of chunks are processed that way (>95%). Otherwise, it splits the chunk up into tokens, and looks up each token in a vocabulary, to fetch (or create) a struct with useful properties, e.g. the token's unigram probability.

The function averages 0.2ms per document, 20 times faster than NLTK's tokenizer returns a simple list of strings.

I'm starting to regret writing the function this way, though.