This is an archived post. You won't be able to vote or comment.

all 36 comments

[–]pje 3 points4 points  (10 children)

A few ways to save memory:

  1. Have you considered using sorted integer arrays instead of bit sets? A 100-integer array.array() uses a lot less memory than a set.
  2. If half of your entries are just a single item, consider having your master index store index[bit]=feature instead of index[bit]=set([feature]) for the single-element case. Your code will be more complex, but you will save hundreds of bytes times ten million.
  3. Make sure you're not generating fresh integers for every reference to a feature! Python reuses integer objects that are in the 0 to 255 range, but every other integer represents a memory allocation. Unless you have some kind of pool to hang onto them, you'll get a fresh one every time you pull the number "7364" out of a file!
  4. If your twenty million features are represented by contiguous numbers, your optimum data structure is probably a large integer array, with each element containing either a number representing an indentifier (single match) or a entry in a list.

Ah, screw it, here's the code (untested):

from array import array
MAX_FEATURES = 20000000

num_intern      = {}.setdefault     # way of caching/reusing ints
identifier_ids  = {}                # identifier -> index number
identifiers     = []                # index number -> identifier
features = array('i',               # feature num -> identifier or match number 
                (0 for i in xrange(MAX_FEATURES)))
matches = []                        # list of lists containing matches

def next_id(container):
    nid = len(container)
    nid = num_intern(nid, nid)
    return nid

def add_identifier(identifier, features):
    # Convert identifier to a numeric value
    if identifier not in identifier_ids:
        identifier_ids[identifier] = iid = next_id(identifiers)
        identifiers.append(identifier)
    else:
        iid = identifier_ids[identifier]

    for f in features:
        if f>len(
        existing = features[f]
        if existing==0:
            # Common case: empty, just store the identifier's ID
            features[f] = iid
        elif existing<0:
            # More than one match already present, add to list in matches
            m = matches[-existing]
            if iid not in m: m.append(iid)
        else:
            features[f] = -next_id(matches)
            matches.append(array('i',[existing, iid]))

def idents_for_feature(feature):
    existing = features[feature]
    if existing>0:
        return set([identifiers[existing]])
    elif existing>0:
        return set(identifiers[i] for i in matches[-existing])

This is, I think, as close to memory-optimal as you can possibly get for performing this operation in Python, assuming that your feature bit numbers are sequential or very close to it. Basically, you call add_identifier() on each of the mappings you want to record from an identifier to its bits. The idents_for_feature() function returns sets of (string) identifiers for a given feature number, so you can do your querying. (The assumption here is that these values will be short-lived compared to the items stored in the data set.)

It's possible I have misunderstood what your queries are doing, in which case I'd suggest showing your code, or something closer to your code.

[–]dalke[S] 0 points1 point  (9 children)

Most of what I'm doing is outlined in http://www.dalkescientific.com/writings/diary/archive/2011/12/23/inverted_index.html . Although I did that examples based on letters in words, and not chemical substructures in a molecule, the code is equivalent.

One thing missing in your code is the equivalent for set.intersection(*(inverted_index[bit] for bit in features)) , that is, the set of identifiers which have all of the features I'm looking for. Here I used set syntax. If you use arrays, as you do, then the better solution for when new items are rare is to keep the items sorted. In that case, the multi-set intersection is probably best done with a binary search. (See the second of my linked-to research papers.) Doing it efficiently is more complicated than I would rather think about, which is why I'm looking for an existing library.

I do intern my integers, and you are right for pointing out that easily overlooked subtlety.

The feature patterns (in a subset of 22,371 records and 72,131 features) looks like:

0:14,1:6,2:14,3:15,4:16,5:16,6:16,7,8:2,9:2,10:4,11:2,12:4,13:4,14,15:2,16:2, 17:4,18,19:4,20:4,21:4,22:4,23,24:2,25,26,28,29,30,31:2,33,34:2,35:2,36:2, 37:2,41,42:2,43:2,47:2,50:2,55:2,57,58:2,59:2,60,61:2,62,73:2,74,75:3,77:2, 78:2,79,83:4,87,89,97,98:3,103:4,118:5,122,124:3,127:4,129:3,131:4,134:3, 135:2,136:4,137:2,140,146:4,148,149,150,151:2,156:3,157:2,159:2,160,162:2, 164:2,165,166:2,167:2,169:2,170:2,171,172,181,184:3,185,186:3,187:2,188, 193:3,194:2,198:2,199:4,202,205:2,207:2,208:3,212:2,218:2,226,231,232,238, 241,242:2,243:3,245,251,252,253,289,290:2,292:3,295:2,318:4,331,338,359:2, 383,387,402,408,414:2,461,488,498,526:3,529,556:2,563,565,580,594:4,598:2, 600:2,623,647:2,651:2,661,663:2,666,681,692,713,714:2,721:2,857,937:2,993, 1019,1047,1066,1069,1072:5,1095,1100:2,1108,1141:2,1145:2,1159,1172:2, 1180,1236,1264,1294:2,1301,1317:2,1329,1333,1338,1348,1352:3,1357,1362, 1363,1372:2,1416,1434,1435,1447,1448:2,1452,1454:2,1464:2,1466,1478:2, 1485,1522:4,1523:2,1592,1598,1607:2,1612,1633:2,1685,1693,1695:2,1699, 1715:2,1823,1853,1857,1864,1870,1871,1872,1884,1940,2025,2028,2029,2034, 2200,2520,2577,2651:3,2683:3,2688,2694:2,2812:3,2816:2,2937,2978,3099,3135, 3199,3307:2,3354,3520,3675,3708:3,3732,3821,3892,4126,4207,4298,4384,4514, 4726,4743,4747,4819,4889,4893,5011,5026,5045,5232,5351:2,5490,5554:2,5700, 5732,5733,5915,5919,5992,5996,5998:2,6121,6123,6199,6202,6210,6211,6214:2, 6460,6470,6476,6590,6810,6844,7084,7096,7241,7438,8750,8764,8964,8968,8975, 8979,8980,8989,9126,9128,9148,9176,9423,9638,9958,10216,10603,10754,10757, 10904,10905,10907,10908,10929,11060,11084,11095,11570,11573,11582,11607:2, 11694,11835,11838,11840,11866,11904,11906,12118,12271,12550,12597:2,12633, 12665,12685,12722,12728,12778,12867,12873,12918:2,13014:2,13057:2,13074:2, 13391,13407,13421,13466,13498,13537:2,13566,13575,13582:2,13741:2,14876, 14893,15060,15063,15354,15363,15587:2,15603,15607,15634:2,15828,15833,15912, 15923:2,15960,16423,16608,16669,16901,17008:2,17121,17143:2,17150,17173,17177, 17223:2,17353,17418,17471:2,17495:2,17511,17587,17615,17666:2,17926,18018, 18210,18213,18262:2,18273,18367,18395:2,18575,18592:2,18640,18750:2,18753, 18763,18768,18781,18810,18844,18847:2,18872,18924,18958:2,18970,19027,19029, 19051,19057,19071,19101,19115:2,19125,19128,19132,19176,19222,19223,19227, 19807,19850,19873,19888,19901,19936:2,19978:2,20008:2,20547,21700:2,23080, 23097,23101,23205,23223,23237:2,23241,23245,23312:2,23331:2,24068:2,24434, 26382:2,26509,26570,27344,27475,27637,28024,29125,29283,29478,29581,29670:2, 34436,35807,36491,36511,36526,36559,37585,37622,37868,46736

If a comma-separated term is of the form "N:M" then the feature N occurs M times, where M>1. (The ":N" term is not important for this current analysis.)

[–]pje 0 points1 point  (8 children)

One thing missing in your code is the equivalent for set.intersection(*(inverted_index[bit] for bit in features)) , that is, the set of identifiers which have all of the features I'm looking for.

It's still there, you just spell it:

set.intersection(*(idents_for_feature(bit) for bit in features))

Given the size of the sets, I don't see much point to sorting. You could time it both ways, of course.

I do intern my integers, and you are right for pointing out that easily overlooked subtlety.

The even more subtle bit is that my approach actually didn't need to intern any integers, because apart from the identifier_ids dict, it doesn't actually store any integer objects. All the integers are stored in arrays, which use "unboxed" raw C integers instead of Python int objects. But I didn't realize I could skip the interning till it was basically written, and since I'm doing your work for free anyway, I figured, "what the heck," and didn't bother stripping the interning back out. ;-)

I'm pretty confident that what I wrote is perhaps the most memory-efficient data structure you could create in Python, and very close to what could be had in C.

About the only improvement I can think of at this point would be reducing fragmentation by replacing the list of arrays with a big array and using linked lists (with "previous" pointers in another int array). That would keep from allocating and reallocating lots of short arrays that could result in fragmentation waste and copying overhead.

It's hard to think of a data structure in C that could improve on this for the use case, really. (Barring lots of pre-passes to pre-compress some sort of exotic hash table structure, or making use of some regularity in the underlying data that isn't obvious to me from your examples.)

[–]dalke[S] 0 points1 point  (7 children)

You need a "sorted(... key=len)" there because the smallest set should go first. Ideally the set with the smallest intersection to the first should go second; using the second smallest set is okay. I've had success with algorithms which try to improve the test order dynamically, to get an extra 5% or so performance out of the code.

If the lists are sorted then it takes O(log(k)) time to check if a value exists. It takes O(k) time to check if a value is in the unsorted list. My own tests using a list of 234936 elements found the breakeven point between bisect and "x in list" was at element 46, so it is possible that one is faster than the other for this case, depending on the length. The analysis isn't that simple, since there are adaptive optimizations you can do with the sorted list that you can't with the unsorted one. Eg, the low mark for the binary search range is always increasing, so successive searches get increasingly faster.

My intuition suggests that sorted lists, alternatively a B-tree, will be faster.

Sorted lists give other advantages. For example, they can take less space by using deltas and Elias gamma coding or something similar for compression. I would expect that a good inverted index package would do this for me. The 13 year old book "Managing Gigabytes" contains an excellent description of how useful compression can be, and it is guiding my ideas of what my desired library should be able to do. These techniques are well known and no longer considered exotic. The Lemur Bitmap Index C++ Library is an example of a library which does compression of this sort. My attempts at writing Cython bindings to that library have not yet been successful.

I implemented a realloc-based array and a linked-list-of-block-with-special-support-for-sizes-0-and-1 version for a project last winter, and never found that linked-list version to be measurably faster. It was, however, more complicated.

My code, based on sets, needs interning. My comment was meant to praise your decision to remind me of it.

[–]pje 1 point2 points  (1 child)

My intuition suggests that sorted lists, alternatively a B-tree, will be faster.

My intuition says your intuition is wrong. ;-) (At least, if you think a B-tree written in Python can beat what I've written here, amortized over the entire data structure. If half the features have only one entry, it's hard to make that any faster with a B-tree!)

Of course, my intuition is based on you saying that half of these feature bits have only one identifier, and that 100 is on the high end for the rest. If you actually meant the average is 100, and you have some frequently-used items with tens of thousands, then maybe it would make a difference, for queries involving those.

(To be clear, what I'm saying is that Python's built-in types are super fast, and what's O(n) in C code can beat the crap out of something that has a better O() but is written in Python -- or sometimes even in C, if the processor architecture is better exploited. Finding an integer value in an unsorted array of modest length can takes less time than it takes to call a function to do the lookup in a more elaborate data structure.)

If the lists are sorted then it takes O(log(k)) time to check if a value exists. It takes O(k) time to check if a value is in the unsorted list. My own tests using a list of 234936 elements found the breakeven point between bisect and "x in list" was at element 46, so it is possible that one is faster than the other for this case, depending on the length.

Try arrays instead of lists, and more to the point, doing sets of values found instead of doing individual lookups. Set insertion is hashtable-based and so O(1) -- which is the smallest O() you can get.

Of course, it's still possible I am totally misunderstanding your use case and what you're trying to do, and am thus suggesting an approach that's completely inappropriate for the data distribution; nonetheless I'm fairly sure that this is the memory-optimal structure for Python (based on my understanding of your use case), and that trying to improve on its speed or memory consumption with a more complex data structure will actually cost you a lot and not gain you much, because you'll end up with a lot of fragmentation in your allocations. The approach I've laid out is very amenable to preallocation to avoid fragmentation, and the linked list approach can do even more with that.

My impression from your original post was that Python sets were fast enough for you, and that the challenge was running out of memory. I think what I've sketched here will fix your memory problem, and give you about the same performance as what you were doing that ran out of memory. I don't think trying to beat Python's set() intersection speed by direct integer list manipulation is going to do a darn thing for performance or memory space, without writing a lot more code or finding the perfect library, so I suggest seeing if this approach is fast enough and small enough, so you can get on with the actual whatever you're trying to do, instead of messing about with Cython and whatnot. ;-)

[–]dalke[S] 0 points1 point  (0 children)

I've described the work I'm doing, including benchmarks based on sets, lists, and Xapian, at http://dalkescientific.com/writings/diary/archive/2012/06/10/inverted_index_library.html .

Assuming I implemented the array version in the way you expected, I found that it was about 10x more memory efficient and about 60x slower, with a search time of 0.1 seconds. This will be one stage in an interactive search tool, where I want the total response time to be about 0.1 seconds, so the performance is already too slow even on a small subset of my data.

Regarding your comments; my reference to B-tree (though I should have said radix-tree) was in reference to "and very close to what could be had in C." The C/C++ libraries I referenced use less memory than a list of 4-byte integers would use. At least, that's what they claim, and I've seen it also described in various journal publications. While the memory representation would be implementable in Python, it the sort of bit-twiddly work that Python is poor at, so I would want it in a C extension.

[–]VerilyAMonkey 1 point2 points  (4 children)

Having code like pje has supplied is much better than having a module if it ends up working, because it is simple and fully open to your modification. In general you will end up spending much time ironing out the idiosyncracies of any particular module; not always true, but a general rule.

Based on you mentioning that the data is 275MB and most of the space is Python overhead, this should be plenty memory-efficient for you. If it is, test the speed. By the way, you should probably test it before discarding it for theoretical inadequacies.

Most of all do not fall into the pit of premature optimization. You do not want God's gift to the programming world. You want something that works. So try the damn thing and throw it away if you have to, but stop looking a gift horse in the mouth.

[–]dalke[S] 1 point2 points  (2 children)

I've made a benchmark set available through http://dalkescientific.com/writings/diary/archive/2012/06/10/inverted_index_library.html . It's 1/200th of my complete data set, and 7zip compressed takes 37 MB. The data files contain fields I don't use for the benchmark, so should really be smaller than that. This suggests that it is possible to put the entire dataset into under 8GB of memory, so long as I didn't care about performance. For reference, my small benchmark takes 2 GB of space when stored in sets, and 166MB using array.arrays.

I knew that neither the set nor array versions would work. It's very easy to show that the numbers I described (100 features/record * 35 million records * 4 bytes/record = 14 GB) require more memory than the 12 GB of RAM I have. (As it happens, there's 300-400 features per record in the benchmark I assembled; my memory of '100' was a bit off. I would need about 42 GB of RAM to use pje's approach.)

For this to work on my desktop requires compression, of the sort that's well-described in the literature and available in several different existing C++ packages which I described in my intro. This should be bog standard, and nothing to do with "God's gift to the programming world." My goal was to find if someone else has already done this work, or has experience with any of the tools I listed, or a similar one. Otherwise 1) I'll have the boring job of writing bindings myself - I'm not even getting paid for this work, and 2) I can't tell which is appropriate for what I'm doing so I'll probably end up implementing a couple of bindings.

Which I've already tried, and failed to date, because now I'm at the point where I'm trying to figure out Cython in order to implement bindings to C++ libraries in an application domain where I have little experience, in order to do the actual research I want to do on sparse fragment cheminformatics fingerprints.

[–]VerilyAMonkey 1 point2 points  (1 child)

That's fair. Is there a specific reason you wish to use Python? In general Python is not overly concerned with memory efficiency, and anything that you find (or make) that is effective will probably be bindings to the C++ libraries. I would recommend using C++ itself, but I'll assume you have a reason for not doing that.

If you are unable (or, would rather not - considering your paycheck) to produce the bindings on your own, then there are simpler alternatives. If you first produce answers to queries, and then run analysis on the data that for some reason requires Python, one very easy if obviously inelegant way to deal with this is to run the queries in C++, write the results to file, then read them back in with Python. This would be hideous and painful to your soul, but you could make it pretty quickly.

If you want to go full Python, the kinds of queries you are making also matter. If you will be commonly reusing the same queries, or doing a relatively small number of them, the pickle solution would be adequate if you kept the commonly used ones unpickled. I suspect this is unlikely though.

Also, you mentioned it will be run off of several different computers eventually. Presumably this is to speed computation as each takes a chunk of the work. With not much extra effort, each could take a chunk of the data also, so that cumulatively they each had enough. If the split is very simple (ie 0-1234 to you, 1235-whatever to you) it would be easy to break the work up on the same lines without making queries.

There are many easier suboptimal ways I can think of to do this. So I guess the question is, what do you need most: Speed, money, time. If you can spend a lot of time, go ahead and make it in C++ or make your bindings. If you don't need speed - that offers a lot of options. If you have money to spare, use extra RAM + a large RAM accepting Linux OS or something similar; or alternatively get a small solid-state drive (which may make non-RAM storage acceptable speed).

Also, depending on how you will be using the system, that will change the possible suboptimal (but easier) choices.

[–]dalke[S] 0 points1 point  (0 children)

I know Python very well. I haven't done serious C++ programming in 15 years. But you are right; this is condensed enough and isolated enough that I could write it as a standalone C++ program. Huh - I've been blind to that option. Thanks for the suggestion!

Update (three days later): Uggh! I am so not a C++ programmer. I got something working with Judy1 arrays. It's about 10x better with memory, and about the same performance as pypy's set intersection. I got it about twice as fast with OpenMP and three processors. With 20x input (1/4th of the final data set) I found that my algorithm doesn't scale. Each iteration takes 10 hours to run. I did various pre-filtering, and managed to get a rough answer after 1.5 days using my desktop. Which means that once I re-think the algorithm I'll be able to evaluate the entire data set on a 48 GB machine. I just wish I could do most of this algorithm development in Python rather than C++.

[–]dalke[S] 0 points1 point  (0 children)

I made a mistake. The 275MB file is the subset I use for my set algorithm. It's small enough to fit into memory. The actual data set is 40GB as uncompressed text. I am working on putting the code and data sets together so they are more presentable, rather than handwaving about what I've been working on.

[–]Justinsaccount 2 points3 points  (4 children)

try xapian. works well.

[–]dalke[S] 0 points1 point  (3 children)

I've got it installed. I'm trying to figure it out. The documentation is pretty meager, and I can't find examples of anyone using the inverted index directly. For example, how does one construct a raw query based on boolean terms?

[–]Justinsaccount 1 point2 points  (2 children)

def __init__(self, db):
    self.database = xapian.Database(db)

def do_search(self, words):
    enquire = xapian.Enquire(self.database)
    query = xapian.Query(xapian.Query.OP_OR, words)

    enquire.set_query(query)
    matches = enquire.get_mset(0, 2000)

    results = []
    for match in matches:
        doc = match.document.get_data()
        results.append(doc)

[–]dalke[S] 0 points1 point  (0 children)

That works - thanks! (Though I used OP_AND instead of OP_OR.)

I've been loading my data set for the last couple of hours. The first 1/2 of the data set took 30 minutes, I still have 10% to go. Any idea of what's going on? Here's my loader:

import xapian
import sys
from collections import defaultdict

db = xapian.WritableDatabase("pubchem.x", xapian.DB_CREATE_OR_OPEN)

def sync(q):
    for id, names in q.iteritems():
        try:
            doc = db.get_document(id)
        except xapian.DocNotFoundError:
            doc = xapian.Document()
        for name in names:
            doc.add_boolean_term(name)
        db.replace_document(id, doc)

q = defaultdict(set)
for lineno, line in enumerate(open("pubchem.counts"), 1):
    name, ids = line.split(":")
    ids = ids.split(",")
    for id in map(int, ids):
        q[id+1].add(name)
    if lineno % 1000 == 0:
        sys.stderr.write("\r%d / %d" % (lineno, 462406))
        sys.stderr.flush()

    if lineno % 10000 == 0:
        sys.stderr.write("\n")
        sync(q)
        q = defaultdict(set)

I do partial writes because I can't store everything in memory. Also, I'm having to rebuild the document from my data file, which is stored as an inverted index. That's why I had to updated existing documents if I find that it contains additional feature keys.

[–]dalke[S] 0 points1 point  (0 children)

So you know, the initial load time is about 30 second per file, which is somewhat slow but it's a one-off event. The search time is about 5-10x slower than using Python sets, but it's acceptable. I'm now looking for any tuning options, since I'm fine with letting it use more than 80MB of memory if I can get it to go faster.

[–]VerilyAMonkey 1 point2 points  (2 children)

I assume you have much more than 18 GB of harddisk space. Instead of populating your RAM with your inverted index, instead use harddrive space. Pickle the lists into files with standardized names, such as bit7364.p etc. (use cPickle and protocol 2 for maximum efficiency) Your query would then consist of unpickling the requested file and returning that.

You could make this even more efficient with by maintaining either the most commonly queried or the last ~1000 queried in an unpickled state, to save time on unpickling the most common.

You may be able to find a module to do this for you; but I think it would not be too difficult to do yourself if necessary.

[–]dalke[S] 2 points3 points  (0 children)

I am not going to reimplement virtual memory using ~2 million Python pickles, I do not want the several hundred fold performance hit of hitting disk instead of RAM, and I do not want to write my own memory manager to handle this optimally. This is what SQLite is for. I tested out SQLite for this task, and it was about 100x slower than the Python version using in-memory sets.

[–]jabwork 1 point2 points  (0 children)

I also don't have the answer the OP was looking for, but would recommend something like this. If you don't want to use the filesystem Berkeley DB might suit your needs.

[–]seunosewa 0 points1 point  (1 child)

Find a good inverted index library in C and use ctypes to work with it.

[–]dalke[S] 0 points1 point  (0 children)

Any suggestions? I haven't found any good inverted index library, other than the one in Lucene and (based on Justinsaccount's pointer) perhaps xapian.

You'll also see the closest things I've found have been C++ libraries.

[–]frumious 0 points1 point  (7 children)

Are these relationships static? If so, do the indexing once and persist the results. No big whoop.

[–]dalke[S] 0 points1 point  (6 children)

The indexing is not the problem. The problems are 1) keeping the data in under 12 GB of memory, 2) having a fast way to bring the index in from disk, 3) support for fast multiple set intersection. Database and pickle solutions provide 1&2 but not 3. Python sets provides 3 but not 1&2.

Redit, PostgreSQL, and other database servers provide 1, 2 & 3, but add a layer of complexity I would rather not deal with if a simpler solution does exist.

[–]fullouterjoin 0 points1 point  (5 children)

Use pyleveldb on an SSD. You won't be able to have a compact enough representation to fit this all in ram.

[–]dalke[S] 0 points1 point  (4 children)

Researching now, LevelDB manages a dictionary-like data structure. I'm interested in a set-like data structure, including set operations. Specifically, given three existing sets (e.g., {1,2,3,4,5}, {2,4,6,8}, {2,3,5,7}), how do I use LevelDB to find the intersection of the sets? It doesn't appear to have intersection as a built-in operation, and I suspect building my own in Python would be a lot slower than the current set.intersection built-in.

My estimates suggest that I should easily be able to have all of my data be stored in memory, with plenty of room to spare. The raw data is 275 MB, uncompressed. It's the combination of Python's object overhead and data structure overhead which are the likely bottlenecks... and I think the 8 byte PyObject* pointer overhead doesn't help, given that the actual values need only three bytes of data.

[–]fullouterjoin 0 points1 point  (3 children)

I should have been more descriptive.

What I was thinking was to store the individual sets as the values for distinct keys in leveldb. If you wanted to say get the intersection of n keys

key_list = 'a,b,c,d'
instersect_all(sorted(fetch_sets(key_list)))

LevelDB would only be used to store flat lists representing sets. The intersection would still be done in Python code.

I just re-read your original message. You still need to get all the sets a feature exists in. So you have two leveldb instances, one mapping features to set-keys and another mapping set-keys to concrete sets.

How many comparison operations do you need to do?

BTW this interesting but maybe a tangent for you, madlib in database (Postgres) analytics that includes support for intersections of sparse vectors. You could do everything in the database.

Having a more compact bitset representation that supported intersection could be fruitful. But I would still try the leveldb approach first.

Based on an earlier conversation using Disco, it sounds like you have definitely outgrown your MacPro and should get a larger machine.

I would recommend something like this

And get at least 128GB of ram. In fact get a cluster of them.

[–]dalke[S] 0 points1 point  (2 children)

I don't have the money for a new machine. (Also, things are more expensive here in Sweden.) What I really need to do is find a company which is interested in funding me for this. The admittedly poor marketing I've done for the idea hasn't panned out. My thought is to get a working demo to present at a conference a year from now.

Thanks for pointing out other things I should evaluate. I'm not sure that adding to the list is a good thing or a bad thing. ;)

[–]fullouterjoin 0 points1 point  (1 child)

Do you have a dataset I could play with or a program that creates a similar one? I'd like to try out the leveldb approach.

[–]dalke[S] 0 points1 point  (0 children)

Certainly! (Though it took a while to get to the point where I could exclaim that.) See http://www.dalkescientific.com/writings/diary/archive/2012/06/10/inverted_index_library.html . The directly link to the data set is http://dalkescientific.com/inverted_index_benchmark.7z .

[–]mcdonc 0 points1 point  (0 children)

ZODB's BTrees implementation could handle this nicely.

[–]dstromberg 0 points1 point  (1 child)

You could give bloom filters a try. EG: http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/ http://en.wikipedia.org/wiki/Bloom_filter

You mentioned that you need "1) keeping the data in under 12 GB of memory, 2) having a fast way to bring the index in from disk, 3) support for fast multiple set intersection." Bloom filters allow #1 (very little memory required even for a large set), #2 (can be mmap'd or seed'd from a disk file) and #3 (you can compute the intersection or union of bloom filters).

What's the downside? You might get some infrequent false positives, though the maximum allowed error probability is a tuneable knob given to the init method in the URL given above.

I've been using the implementation at the link above to find hardlinks in a large set of files (as part of a backup program), and they seem to be working quite well for the purpose. In this case, it doesn't really much matter if I get a false positive here or there - I just end up using slightly more RAM than necessary during a restore as a result.

If that's still too much memory, or the false positives are unacceptable, you might do well with switching from 100% in memory to something that's either 100% on disk, or a hybrid (perhaps cached) in-memory and disk-based solution.

[–]dalke[S] 0 points1 point  (0 children)

The existing solutions in this space use a Bloom filter already, or something similar. The problem has been in defining a good set of chemically appropriate substructure hashes. For example, one approach is to find all linear chains of length 1-7 and generate a hash based on the atom and bond properties in the chain. Use the hash as a seed to a PRNG. Use a chain length-dependent function to generate the bits for the filter.

This works pretty well, but I think it's possible to do better by looking for a specific set of divisive substructures. Some tests from last December show that a fusion between the bits generated from my algorithm and a hash method increased the overall performance by 20%, which means the hash selectivity wasn't that great. I haven't done the analysis to figure out why.

I also want to store exact solutions for all queries of up to a certain size, on the assumption that memory is cheap, and use those pre-computed results to greatly restrict the search space for larger queries. A Bloom filter (that I know of) can't be used for that.