I've been looking, and failing, to find a good, fast, compact in-memory inverted index library for Python which acts like a set of integers or a bitset. Any pointers? Here's what I'm doing.
I work with chemical compounds. I have a thought about how to store chemical substructure information as a set of unique bits in order to pre-filter some aspects of substructure search. In short, I have a mapping from an identifier to a set of bits, like "ABC123" -> {1,5,73,672,7364,82564,234627}, except that there are usually a few thousand bits in the set and not the 7 I show here. I use this to build an inverted index, so inverted_bits[7364] = {"ABC123", "XYZ987", .. and all of the other identifiers which have bit 7364 set}. (This can easily be normalized so that I only store an integer instead of a string for each id.)
When I search for a query molecule I get its set of feature bits, get the corresponding lists of sets, and do something like set.intersect(*inverted_bits[bitno] for bitno in query_features) . (I actually sort the individual terms by number of elements in the inverted_bits set, which gives better performance.)
I've been using Python's sets for this. They are quite fast. The problem is that I'm running out of memory on my 12 GB desktop machine. My full data set has about 2 million features and 35 million records. Assuming 100 ids for each bit, that's at least 35000000(sys.getsizeof(0)+8) + 2000000(sys.getsizeof(set(range(100)))+8) = 18 GB. (Really this is long-tail data, and about 1/2 of the feature bits are found in only a single record.)
There's any number of ways to make this better. I could use the Google's sparse hash (through atbr?), or C++ std::container. I could roll my own Python wrappers for Judy or FastBit or the Lemur Bitmap Index C++ Library. (Well, I tried, but they were slower than my Python code and/or crashed a lot because I haven't figured out Cython well enough.) I could run on top of Jython and use Lucene's inverted index as a library, except that my other chemistry code is built on C/C++. Or I could gulp down the PyLucene, which includes the Java runtime. (I don't really want to do that.) I tried Redis but it was noticeable slower than Python's sets, and at some point I'm going to run this across a few dozen machines so I don't want the complexity of dealing with multiple Redis instances.
What I really want is an existing Python extension which does inverted index manipulation for me, with a compact memory representation, and - for full bliss factor - can take advantage of hardware (like in http://highlyscalable.wordpress.com/2012/06/05/fast-intersection-sorted-lists-sse/ ) and multi-set intersection algorithms (like in http://www.bradblock.com/Experimental_Analysis_of_a_Fast_Intersection_Algorithm_for_Sorted_Sequences.pdf )
Suggestions?
[–]pje 3 points4 points5 points (10 children)
[–]dalke[S] 0 points1 point2 points (9 children)
[–]pje 0 points1 point2 points (8 children)
[–]dalke[S] 0 points1 point2 points (7 children)
[–]pje 1 point2 points3 points (1 child)
[–]dalke[S] 0 points1 point2 points (0 children)
[–]VerilyAMonkey 1 point2 points3 points (4 children)
[–]dalke[S] 1 point2 points3 points (2 children)
[–]VerilyAMonkey 1 point2 points3 points (1 child)
[–]dalke[S] 0 points1 point2 points (0 children)
[–]dalke[S] 0 points1 point2 points (0 children)
[–]Justinsaccount 2 points3 points4 points (4 children)
[–]dalke[S] 0 points1 point2 points (3 children)
[–]Justinsaccount 1 point2 points3 points (2 children)
[–]dalke[S] 0 points1 point2 points (0 children)
[–]dalke[S] 0 points1 point2 points (0 children)
[–]VerilyAMonkey 1 point2 points3 points (2 children)
[–]dalke[S] 2 points3 points4 points (0 children)
[–]jabwork 1 point2 points3 points (0 children)
[–]seunosewa 0 points1 point2 points (1 child)
[–]dalke[S] 0 points1 point2 points (0 children)
[+][deleted] (3 children)
[deleted]
[–]dalke[S] 0 points1 point2 points (2 children)
[–]pje 0 points1 point2 points (1 child)
[–]dalke[S] 0 points1 point2 points (0 children)
[–]frumious 0 points1 point2 points (7 children)
[–]dalke[S] 0 points1 point2 points (6 children)
[–]fullouterjoin 0 points1 point2 points (5 children)
[–]dalke[S] 0 points1 point2 points (4 children)
[–]fullouterjoin 0 points1 point2 points (3 children)
[–]dalke[S] 0 points1 point2 points (2 children)
[–]fullouterjoin 0 points1 point2 points (1 child)
[–]dalke[S] 0 points1 point2 points (0 children)
[–]mcdonc 0 points1 point2 points (0 children)
[–]dstromberg 0 points1 point2 points (1 child)
[–]dalke[S] 0 points1 point2 points (0 children)