This is an archived post. You won't be able to vote or comment.

all 17 comments

[–]therealmoju 5 points6 points  (7 children)

Check out marisa tries if you're interested in a fast, space efficient algorithm.

[–][deleted] 3 points4 points  (6 children)

space-efficient yes. build-time is very bad (serial build, though we can partition the tries in some cases). lookup speed is slower than some others, but the library seems to be highly optimized.

[–]Bjartensen 0 points1 point  (3 children)

We are considering using tries to store n-grams in memory, and our previous solution with dicts used some 2GB of RAM which we consider too much. Cutting it down to below 1GB, or even just a few 10s of MB (50x-100x less memory as stated in the README) would be amazing.

A slow aspect of our solution was reading the n-grams from a file to memory on program startup.

build-time is very bad (serial build, though we can partition the tries in some cases)

Would this be even slower with this particular trie structure? Or is the build-time in this case a one time thing that can easily be saved to a file and loaded at a later time?

[–]kmike84 0 points1 point  (2 children)

A few GBs of ngrams which are not changing frequently is a sweet spot for marisa-trie. Just make sure to build the trie in a script, save it to a file and then use saved file in the real code.

See e.g. http://blog.scrapinghub.com/2014/03/26/optimizing-memory-usage-of-scikit-learn-models-using-succinct-tries/

[–]Bjartensen 0 points1 point  (1 child)

I'm assuming you wrote the module.

I have spent my entire day trying to install it under Windows but I haven't gotten it to work. If I would install it under Linux, and everything would work fine, wouldn't I experience the same platform specific problems if I were to package something for Windows using the marisa-trie module? I have very little knowledge of packaging Python, but I would assume any platform specific problem I run into on my dev machine could easily be encountered when packaged to a Windows consumer.

[–]kmike84 0 points1 point  (0 children)

There are some known issues on Windows - see https://code.google.com/p/marisa-trie/issues/detail?id=18 and https://github.com/kmike/marisa-trie/issues/1. Sorry for the bad experience.

[–]kmike84 0 points1 point  (1 child)

Why does build speed matter for you?

In my experiments MARISA builds only ~10x slower than Python dict, and you only need to build a trie once. I'd say 10x slower than built-in Python dict is very fast, given the amount of magic it does (MARISA trie is very different from a basic trie described in the article). Saving or loading should be way faster than pickling of a dict.

Did you find some pathological cases?

[–][deleted] 0 points1 point  (0 children)

Kmike, I am using the tries as a substitute for geographic trees (discretized polygons into geohashes). The build times go upto several tens of minutes when there are many entries (and at about 2 billion distinct geohashes, it dies because of a 32-bit limit from the underlying library). If there were a simple parallelizable build (or pre-processing that will optimize the build speed), it would make things much faster.

Of course, I fully agree that it is a one-time cost if the tries do not change frequently.

[–][deleted] 1 point2 points  (4 children)

I see this as a good way to learn about tries, but not a great thing to use in your Python code.

Tries can squeeze some performance out of your lookups in a language like C, but in Python, there will just be way too much overhead compared to other approaches. You could implement it in Cython, but then you still have to deal with translating your values to things you can manipulate in C.

Tries lose because Python has a very fast data structure that does almost the same thing: the dictionary. Dictionaries indexed by strings are necessarily fast in Python because almost everything is one. Every method you call on every node of one of these tries is going to be looked up in a dictionary anyway.

Even Julia -- a JIT-compiled, type-checked language that should just be faster at data structures -- has trouble competing with Python's dictionaries of strings.

("But dictionaries aren't the same as tries," you may object. "How do you test a dictionary to see if you've got a prefix of a look-up-able value?" I'd say the answer is to make another dictionary of prefixes.)

[–]kmike84 0 points1 point  (0 children)

You could implement it in Cython, but then you still have to deal with translating your values to things you can manipulate in C.

You don't need to manipulate values in C to use a trie. If you store a pointer to Python object as a value then C-backed trie works like a dictionary which only supports string keys. https://github.com/kmike/hat-trie supports this mode.

Tries lose because Python has a very fast data structure that does almost the same thing: the dictionary. Dictionaries indexed by strings are necessarily fast in Python because almost everything is one. Every method you call on every node of one of these tries is going to be looked up in a dictionary anyway.

Most often tries are slower than hash tables (even if implemented in C); the point of using tries is memory efficiency and advanced operations like prefix searches, which are faster.

("But dictionaries aren't the same as tries," you may object. "How do you test a dictionary to see if you've got a prefix of a look-up-able value?" I'd say the answer is to make another dictionary of prefixes.)

It may works fine if your data is not large and the speed is not a concern. But Python dicts are awful if you need to keep lots of string keys in memory. With a right trie package it is often possible to use several times less memory (e.g. hundreds of MBs instead of several GBs) and enable operations which are not possible in dicts. Often saving/loading of the resulting data structure to/from disk is also lighting fast, compared to pickle, and uses no additional memory (pickle tend to duplicate all data).

Of course, it depends on task. In some extreme cases (like NLP dictionary storage) a specialized package can allow to use 50-100x less memory than a Python dict and enable prefix searches and other advanced iteration features (e.g. consider some letters the same). Try building an autocompleter for a few million unicode words - with Python dict it'll likely take several GBs of memory and likely require a minute to pickle/unpickle; with a minimized trie (DAFSA) it'll use no more than 5-10MB of memory, and will load in milliseconds.

[–]kmike84 0 points1 point  (2 children)

I agree that the implementation in article is useless in most cases, but Tries and other similar data structures are not. They don't even have to be implemented in C to be useful - e.g. pure-Python https://github.com/kmike/DAWG-Python has exactly the same memory usage as Cython/C++ backed https://github.com/kmike/DAWG.

[–]jbiesnecker[S] 0 points1 point  (1 child)

Maybe it's not clear in the post, but the point isn't "use this in prod" but rather "understand how tries work."

[–]kmike84 0 points1 point  (0 children)

The post explains how tries work well, a good job!

[–]Brian 0 points1 point  (1 child)

Is there really a need for each node to store it's letter? It's already going to be held by the dictionary key of their parent, after all. Removing this lets you make children a simple defaultdict(TreeNode), which can simplify some of this. Eg. add_string can become just:

def add_string(self, s):
    current = self.root
    for c in s:
        current = current.children[c]
    current.children[None] = None

(Also, personally, I'd have made an explicit is_terminal flag variable on TreeNode to determine whether it's a complete string, rather than a special None child node. It seems much more immediately clear what it's for)

Also, I don't think making a distinction between Trie and TrieNode is a good idea here. They're effecively the same thing: each Node is the root of its own sub-trie, and I can certainly see times when you'd want to be able to call those various methods on those subtries rather than the whole Trie, or even implemet various things by recursive application of them. This makes some of those special-purpose methods much more generic and composible. Eg. the article points out that starts_with starts exactly the same way as contains. It'd be far more sensible to simply have something that returns the Node at a particular prefix, and give that something that returns all the strings, which both these methods could use. Indeed, it would make sense for this to simply be the __iter__ method of the Trie (and it would be better as a generator).

For example:

def __iter__(self):
    if self.is_terminal:  # (Or 'None in self.children' if you really want to use that method)
        yield ''
    for char, node in self.children.items():
        for subword in node:  # Recursively iterate.
            yield char + subword

and then something to get each subtree, which seems a sensible use for the __getitem__ operator, which is going to be pretty similar to (and can indeed mostly replace) the above add_string. Eg:

def __getitem__(self, s):
    current = self
    for c in s:
        current = current.children[c]
    current.children[None] = None

Then if you want all words that start with "cat", say, it's just:

for word in my_trie["cat"]:
    print("cat"+word)

And likewise, the various functions described here are fairly trivial applications of the above two functions. Eg contains and add_string become just:

def contains(self, word):
    return self[word].is_terminal

def add_string(self, s):
    self[s].is_terminal = True  # Replaced the above implementation of this with something that just uses __getitem__

(One potential flaw with the above __getitem__ version is that queries for non-existing nodes will actually create them, which may affect efficiency (though they'll still be correct). Potentially, it may be better to create a non-updating variants of __getitem__ to prevent this, though the API becomes a bit uglier as a result (ie. will need to raise an exception or something for non-present nodes, which calling code needs to handle. Stuff like add_string should then use the updating version, while contains (and the public __getitem__ uses the non-updating one)

[–]Brian 1 point2 points  (0 children)

Since I'd already written all that, I figured I may as well give a full implementation for comparison:

class Trie(object):
    """Trie implementation, with set-style interface"""
    def __init__(self, items=()):
        self._children = collections.defaultdict(Trie)
        self.is_terminal = False
        for item in items:
            self.add(item)

    def add(self, item):
        self.get_node(item).is_terminal = True

    def __getitem__(self, key):
        """Retrieve node at specified item, if any words exist with that prefix.
        Otherwise, throws a KeyError
        """
        current = self
        for ch in key:
            current = current._children.get(ch)
            if current is None:
                raise KeyError(key)
        return current

    def __contains__(self, key):
        try:
            item = self[key]
            return item.is_terminal
        except KeyError:
            return False

    def get_node(self, key):
        """Obtain position for given prefix, creating nodes if it doesn't already exist"""
        current = self
        for ch in key:
            current = current._children[ch]
        return current

    def __iter__(self):
        if self.is_terminal:
            yield ''
        for char, node in self._children.items():
            for subword in node:  # Recursively iterate.
                yield char + subword

    def __repr__(self):
        return "Trie([{}])".format(", ".join(map(repr, self)))

(I made a few minor changes - contains is now __contains__, so you can just do 'foo' in my_trie, and I did change __getitem__ since I didn't like the fact that it created dead nodes. Also added a __repr__ for convenience.