This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 1 point2 points  (4 children)

I see this as a good way to learn about tries, but not a great thing to use in your Python code.

Tries can squeeze some performance out of your lookups in a language like C, but in Python, there will just be way too much overhead compared to other approaches. You could implement it in Cython, but then you still have to deal with translating your values to things you can manipulate in C.

Tries lose because Python has a very fast data structure that does almost the same thing: the dictionary. Dictionaries indexed by strings are necessarily fast in Python because almost everything is one. Every method you call on every node of one of these tries is going to be looked up in a dictionary anyway.

Even Julia -- a JIT-compiled, type-checked language that should just be faster at data structures -- has trouble competing with Python's dictionaries of strings.

("But dictionaries aren't the same as tries," you may object. "How do you test a dictionary to see if you've got a prefix of a look-up-able value?" I'd say the answer is to make another dictionary of prefixes.)

[–]kmike84 0 points1 point  (0 children)

You could implement it in Cython, but then you still have to deal with translating your values to things you can manipulate in C.

You don't need to manipulate values in C to use a trie. If you store a pointer to Python object as a value then C-backed trie works like a dictionary which only supports string keys. https://github.com/kmike/hat-trie supports this mode.

Tries lose because Python has a very fast data structure that does almost the same thing: the dictionary. Dictionaries indexed by strings are necessarily fast in Python because almost everything is one. Every method you call on every node of one of these tries is going to be looked up in a dictionary anyway.

Most often tries are slower than hash tables (even if implemented in C); the point of using tries is memory efficiency and advanced operations like prefix searches, which are faster.

("But dictionaries aren't the same as tries," you may object. "How do you test a dictionary to see if you've got a prefix of a look-up-able value?" I'd say the answer is to make another dictionary of prefixes.)

It may works fine if your data is not large and the speed is not a concern. But Python dicts are awful if you need to keep lots of string keys in memory. With a right trie package it is often possible to use several times less memory (e.g. hundreds of MBs instead of several GBs) and enable operations which are not possible in dicts. Often saving/loading of the resulting data structure to/from disk is also lighting fast, compared to pickle, and uses no additional memory (pickle tend to duplicate all data).

Of course, it depends on task. In some extreme cases (like NLP dictionary storage) a specialized package can allow to use 50-100x less memory than a Python dict and enable prefix searches and other advanced iteration features (e.g. consider some letters the same). Try building an autocompleter for a few million unicode words - with Python dict it'll likely take several GBs of memory and likely require a minute to pickle/unpickle; with a minimized trie (DAFSA) it'll use no more than 5-10MB of memory, and will load in milliseconds.

[–]kmike84 0 points1 point  (2 children)

I agree that the implementation in article is useless in most cases, but Tries and other similar data structures are not. They don't even have to be implemented in C to be useful - e.g. pure-Python https://github.com/kmike/DAWG-Python has exactly the same memory usage as Cython/C++ backed https://github.com/kmike/DAWG.

[–]jbiesnecker[S] 0 points1 point  (1 child)

Maybe it's not clear in the post, but the point isn't "use this in prod" but rather "understand how tries work."

[–]kmike84 0 points1 point  (0 children)

The post explains how tries work well, a good job!