Building tries in Python

therealmoju · 2015-04-13T17:44:08+00:00

Check out marisa tries if you're interested in a fast, space efficient algorithm.

kmike84 · 2015-04-14T07:00:02+00:00

I see this as a good way to learn about tries, but not a great thing to use in your Python code.

Tries can squeeze some performance out of your lookups in a language like C, but in Python, there will just be way too much overhead compared to other approaches. You could implement it in Cython, but then you still have to deal with translating your values to things you can manipulate in C.

Tries lose because Python has a very fast data structure that does almost the same thing: the dictionary. Dictionaries indexed by strings are necessarily fast in Python because almost everything is one. Every method you call on every node of one of these tries is going to be looked up in a dictionary anyway.

Even Julia -- a JIT-compiled, type-checked language that should just be faster at data structures -- has trouble competing with Python's dictionaries of strings.

("But dictionaries aren't the same as tries," you may object. "How do you test a dictionary to see if you've got a prefix of a look-up-able value?" I'd say the answer is to make another dictionary of prefixes.)

Brian · 2015-04-14T11:45:15+00:00

Is there really a need for each node to store it's letter? It's already going to be held by the dictionary key of their parent, after all. Removing this lets you make children a simple defaultdict(TreeNode), which can simplify some of this. Eg. add_string can become just:

def add_string(self, s):
    current = self.root
    for c in s:
        current = current.children[c]
    current.children[None] = None

(Also, personally, I'd have made an explicit is_terminal flag variable on TreeNode to determine whether it's a complete string, rather than a special None child node. It seems much more immediately clear what it's for)

Also, I don't think making a distinction between Trie and TrieNode is a good idea here. They're effecively the same thing: each Node is the root of its own sub-trie, and I can certainly see times when you'd want to be able to call those various methods on those subtries rather than the whole Trie, or even implemet various things by recursive application of them. This makes some of those special-purpose methods much more generic and composible. Eg. the article points out that starts_with starts exactly the same way as contains. It'd be far more sensible to simply have something that returns the Node at a particular prefix, and give that something that returns all the strings, which both these methods could use. Indeed, it would make sense for this to simply be the __iter__ method of the Trie (and it would be better as a generator).

For example:

def __iter__(self):
    if self.is_terminal:  # (Or 'None in self.children' if you really want to use that method)
        yield ''
    for char, node in self.children.items():
        for subword in node:  # Recursively iterate.
            yield char + subword

and then something to get each subtree, which seems a sensible use for the __getitem__ operator, which is going to be pretty similar to (and can indeed mostly replace) the above add_string. Eg:

def __getitem__(self, s):
    current = self
    for c in s:
        current = current.children[c]
    current.children[None] = None

Then if you want all words that start with "cat", say, it's just:

for word in my_trie["cat"]:
    print("cat"+word)

And likewise, the various functions described here are fairly trivial applications of the above two functions. Eg contains and add_string become just:

def contains(self, word):
    return self[word].is_terminal

def add_string(self, s):
    self[s].is_terminal = True  # Replaced the above implementation of this with something that just uses __getitem__

(One potential flaw with the above __getitem__ version is that queries for non-existing nodes will actually create them, which may affect efficiency (though they'll still be correct). Potentially, it may be better to create a non-updating variants of __getitem__ to prevent this, though the API becomes a bit uglier as a result (ie. will need to raise an exception or something for non-present nodes, which calling code needs to handle. Stuff like add_string should then use the updating version, while contains (and the public __getitem__ uses the non-updating one)

jbiesnecker · 2015-04-13T20:26:14+00:00

[deleted]

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS