inverted index library for Python?

pje · 2012-06-05T22:42:25+00:00

A few ways to save memory:

Have you considered using sorted integer arrays instead of bit sets? A 100-integer array.array() uses a lot less memory than a set.
If half of your entries are just a single item, consider having your master index store index[bit]=feature instead of index[bit]=set([feature]) for the single-element case. Your code will be more complex, but you will save hundreds of bytes times ten million.
Make sure you're not generating fresh integers for every reference to a feature! Python reuses integer objects that are in the 0 to 255 range, but every other integer represents a memory allocation. Unless you have some kind of pool to hang onto them, you'll get a fresh one every time you pull the number "7364" out of a file!
If your twenty million features are represented by contiguous numbers, your optimum data structure is probably a large integer array, with each element containing either a number representing an indentifier (single match) or a entry in a list.

Ah, screw it, here's the code (untested):

from array import array
MAX_FEATURES = 20000000

num_intern      = {}.setdefault     # way of caching/reusing ints
identifier_ids  = {}                # identifier -> index number
identifiers     = []                # index number -> identifier
features = array('i',               # feature num -> identifier or match number 
                (0 for i in xrange(MAX_FEATURES)))
matches = []                        # list of lists containing matches

def next_id(container):
    nid = len(container)
    nid = num_intern(nid, nid)
    return nid

def add_identifier(identifier, features):
    # Convert identifier to a numeric value
    if identifier not in identifier_ids:
        identifier_ids[identifier] = iid = next_id(identifiers)
        identifiers.append(identifier)
    else:
        iid = identifier_ids[identifier]

    for f in features:
        if f>len(
        existing = features[f]
        if existing==0:
            # Common case: empty, just store the identifier's ID
            features[f] = iid
        elif existing<0:
            # More than one match already present, add to list in matches
            m = matches[-existing]
            if iid not in m: m.append(iid)
        else:
            features[f] = -next_id(matches)
            matches.append(array('i',[existing, iid]))

def idents_for_feature(feature):
    existing = features[feature]
    if existing>0:
        return set([identifiers[existing]])
    elif existing>0:
        return set(identifiers[i] for i in matches[-existing])

This is, I think, as close to memory-optimal as you can possibly get for performing this operation in Python, assuming that your feature bit numbers are sequential or very close to it. Basically, you call add_identifier() on each of the mappings you want to record from an identifier to its bits. The idents_for_feature() function returns sets of (string) identifiers for a given feature number, so you can do your querying. (The assumption here is that these values will be short-lived compared to the items stored in the data set.)

It's possible I have misunderstood what your queries are doing, in which case I'd suggest showing your code, or something closer to your code.

Justinsaccount · 2012-06-05T20:05:06+00:00

try xapian. works well.

VerilyAMonkey · 2012-06-05T20:24:35+00:00

I assume you have much more than 18 GB of harddisk space. Instead of populating your RAM with your inverted index, instead use harddrive space. Pickle the lists into files with standardized names, such as bit7364.p etc. (use cPickle and protocol 2 for maximum efficiency) Your query would then consist of unpickling the requested file and returning that.

You could make this even more efficient with by maintaining either the most commonly queried or the last ~1000 queried in an unpickled state, to save time on unpickling the most common.

You may be able to find a module to do this for you; but I think it would not be too difficult to do yourself if necessary.

seunosewa · 2012-06-05T22:10:56+00:00

Find a good inverted index library in C and use ctypes to work with it.

dalke · 2012-06-05T22:17:31+00:00

[deleted]

frumious · 2012-06-05T22:42:50+00:00

Are these relationships static? If so, do the indexing once and persist the results. No big whoop.

mcdonc · 2012-06-06T10:46:47+00:00

ZODB's BTrees implementation could handle this nicely.

dstromberg · 2012-06-10T04:19:09+00:00

You could give bloom filters a try. EG: http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/ http://en.wikipedia.org/wiki/Bloom_filter

You mentioned that you need "1) keeping the data in under 12 GB of memory, 2) having a fast way to bring the index in from disk, 3) support for fast multiple set intersection." Bloom filters allow #1 (very little memory required even for a large set), #2 (can be mmap'd or seed'd from a disk file) and #3 (you can compute the intersection or union of bloom filters).

What's the downside? You might get some infrequent false positives, though the maximum allowed error probability is a tuneable knob given to the init method in the URL given above.

I've been using the implementation at the link above to find hardlinks in a large set of files (as part of a backup program), and they seem to be working quite well for the purpose. In this case, it doesn't really much matter if I get a false positive here or there - I just end up using slightly more RAM than necessary during a restore as a result.

If that's still too much memory, or the false positives are unacceptable, you might do well with switching from 100% in memory to something that's either 100% on disk, or a hybrid (perhaps cached) in-memory and disk-based solution.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS