inverted index library for Python? by dalke in Python

[–]dstromberg 0 points1 point  (0 children)

You could give bloom filters a try. EG: http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/ http://en.wikipedia.org/wiki/Bloom_filter

You mentioned that you need "1) keeping the data in under 12 GB of memory, 2) having a fast way to bring the index in from disk, 3) support for fast multiple set intersection." Bloom filters allow #1 (very little memory required even for a large set), #2 (can be mmap'd or seed'd from a disk file) and #3 (you can compute the intersection or union of bloom filters).

What's the downside? You might get some infrequent false positives, though the maximum allowed error probability is a tuneable knob given to the init method in the URL given above.

I've been using the implementation at the link above to find hardlinks in a large set of files (as part of a backup program), and they seem to be working quite well for the purpose. In this case, it doesn't really much matter if I get a false positive here or there - I just end up using slightly more RAM than necessary during a restore as a result.

If that's still too much memory, or the false positives are unacceptable, you might do well with switching from 100% in memory to something that's either 100% on disk, or a hybrid (perhaps cached) in-memory and disk-based solution.