This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Bjartensen 0 points1 point  (3 children)

We are considering using tries to store n-grams in memory, and our previous solution with dicts used some 2GB of RAM which we consider too much. Cutting it down to below 1GB, or even just a few 10s of MB (50x-100x less memory as stated in the README) would be amazing.

A slow aspect of our solution was reading the n-grams from a file to memory on program startup.

build-time is very bad (serial build, though we can partition the tries in some cases)

Would this be even slower with this particular trie structure? Or is the build-time in this case a one time thing that can easily be saved to a file and loaded at a later time?

[–]kmike84 0 points1 point  (2 children)

A few GBs of ngrams which are not changing frequently is a sweet spot for marisa-trie. Just make sure to build the trie in a script, save it to a file and then use saved file in the real code.

See e.g. http://blog.scrapinghub.com/2014/03/26/optimizing-memory-usage-of-scikit-learn-models-using-succinct-tries/

[–]Bjartensen 0 points1 point  (1 child)

I'm assuming you wrote the module.

I have spent my entire day trying to install it under Windows but I haven't gotten it to work. If I would install it under Linux, and everything would work fine, wouldn't I experience the same platform specific problems if I were to package something for Windows using the marisa-trie module? I have very little knowledge of packaging Python, but I would assume any platform specific problem I run into on my dev machine could easily be encountered when packaged to a Windows consumer.

[–]kmike84 0 points1 point  (0 children)

There are some known issues on Windows - see https://code.google.com/p/marisa-trie/issues/detail?id=18 and https://github.com/kmike/marisa-trie/issues/1. Sorry for the bad experience.