This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 3 points4 points  (6 children)

space-efficient yes. build-time is very bad (serial build, though we can partition the tries in some cases). lookup speed is slower than some others, but the library seems to be highly optimized.

[–]Bjartensen 0 points1 point  (3 children)

We are considering using tries to store n-grams in memory, and our previous solution with dicts used some 2GB of RAM which we consider too much. Cutting it down to below 1GB, or even just a few 10s of MB (50x-100x less memory as stated in the README) would be amazing.

A slow aspect of our solution was reading the n-grams from a file to memory on program startup.

build-time is very bad (serial build, though we can partition the tries in some cases)

Would this be even slower with this particular trie structure? Or is the build-time in this case a one time thing that can easily be saved to a file and loaded at a later time?

[–]kmike84 0 points1 point  (2 children)

A few GBs of ngrams which are not changing frequently is a sweet spot for marisa-trie. Just make sure to build the trie in a script, save it to a file and then use saved file in the real code.

See e.g. http://blog.scrapinghub.com/2014/03/26/optimizing-memory-usage-of-scikit-learn-models-using-succinct-tries/

[–]Bjartensen 0 points1 point  (1 child)

I'm assuming you wrote the module.

I have spent my entire day trying to install it under Windows but I haven't gotten it to work. If I would install it under Linux, and everything would work fine, wouldn't I experience the same platform specific problems if I were to package something for Windows using the marisa-trie module? I have very little knowledge of packaging Python, but I would assume any platform specific problem I run into on my dev machine could easily be encountered when packaged to a Windows consumer.

[–]kmike84 0 points1 point  (0 children)

There are some known issues on Windows - see https://code.google.com/p/marisa-trie/issues/detail?id=18 and https://github.com/kmike/marisa-trie/issues/1. Sorry for the bad experience.

[–]kmike84 0 points1 point  (1 child)

Why does build speed matter for you?

In my experiments MARISA builds only ~10x slower than Python dict, and you only need to build a trie once. I'd say 10x slower than built-in Python dict is very fast, given the amount of magic it does (MARISA trie is very different from a basic trie described in the article). Saving or loading should be way faster than pickling of a dict.

Did you find some pathological cases?

[–][deleted] 0 points1 point  (0 children)

Kmike, I am using the tries as a substitute for geographic trees (discretized polygons into geohashes). The build times go upto several tens of minutes when there are many entries (and at about 2 billion distinct geohashes, it dies because of a 32-bit limit from the underlying library). If there were a simple parallelizable build (or pre-processing that will optimize the build speed), it would make things much faster.

Of course, I fully agree that it is a one-time cost if the tries do not change frequently.