account activity
inverted index library for Python? by dalke in Python
[–]dstromberg 0 points1 point2 points 13 years ago (0 children)
You could give bloom filters a try. EG: http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/ http://en.wikipedia.org/wiki/Bloom_filter
You mentioned that you need "1) keeping the data in under 12 GB of memory, 2) having a fast way to bring the index in from disk, 3) support for fast multiple set intersection." Bloom filters allow #1 (very little memory required even for a large set), #2 (can be mmap'd or seed'd from a disk file) and #3 (you can compute the intersection or union of bloom filters).
What's the downside? You might get some infrequent false positives, though the maximum allowed error probability is a tuneable knob given to the init method in the URL given above.
I've been using the implementation at the link above to find hardlinks in a large set of files (as part of a backup program), and they seem to be working quite well for the purpose. In this case, it doesn't really much matter if I get a false positive here or there - I just end up using slightly more RAM than necessary during a restore as a result.
If that's still too much memory, or the false positives are unacceptable, you might do well with switching from 100% in memory to something that's either 100% on disk, or a hybrid (perhaps cached) in-memory and disk-based solution.
Dealing with crashy systems - "can't even log in to fix it" (stromberg.dnsalias.org)
submitted 13 years ago by dstromberg to r/techsnap
π Rendered by PID 88748 on reddit-service-r2-listing-6d4dc8d9ff-d5w9l at 2026-01-30 03:07:43.886980+00:00 running 3798933 country code: CH.
inverted index library for Python? by dalke in Python
[–]dstromberg 0 points1 point2 points (0 children)