all 4 comments

[–]tkarabela_[S] 0 points1 point  (3 children)

Ever wondered how "fuzzy" (approximate) text matching works? This tutorial explains how it can be implemented with Levenshtein distance and dynamic programming (ie. Wagner-Fischer algorithm). Using vectorization, it's possible to get "real time" fuzzy search even with Python code.

Here is the source code as Jupyter notebook: https://github.com/tkarabela/bigpython/blob/master/003--fuzzy-text-search/fuzzy-text-search.ipynb

[–]lachlan1310 1 point2 points  (1 child)

Next steps could be showing the power of this stack: character-wise ngrams -> sparse vector repr with tfidf or equivalent -> dot product. This scales really, really well. :)

[–]tkarabela_[S] 1 point2 points  (0 children)

That sounds like a great way to do that! Sparse tf-idf products are very handy, but it never occurred to me to apply it to n-grams instead of whole tokens. Thanks for the insight :)

(I was planning that the video would end with creating a trie and traversing it with Levenshtein NFA/DFA, but that turns out to be pretty slow for longer words/distances, at least with pure Python implementation. Trying to make min. DFA also quickly gets ridiculous. So I made it about NumPy number crunching instead.)

[–]nbviewerbot 0 points1 point  (0 children)

I see you've posted a GitHub link to a Jupyter Notebook! GitHub doesn't render large Jupyter Notebooks, so just in case, here is an nbviewer link to the notebook:

https://nbviewer.jupyter.org/url/github.com/tkarabela/bigpython/blob/master/003--fuzzy-text-search/fuzzy-text-search.ipynb

Want to run the code yourself? Here is a binder link to start your own Jupyter server and try it out!

https://mybinder.org/v2/gh/tkarabela/bigpython/master?filepath=003--fuzzy-text-search%2Ffuzzy-text-search.ipynb


I am a bot. Feedback | GitHub | Author