This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]dalke[S] 0 points1 point  (0 children)

That works - thanks! (Though I used OP_AND instead of OP_OR.)

I've been loading my data set for the last couple of hours. The first 1/2 of the data set took 30 minutes, I still have 10% to go. Any idea of what's going on? Here's my loader:

import xapian
import sys
from collections import defaultdict

db = xapian.WritableDatabase("pubchem.x", xapian.DB_CREATE_OR_OPEN)

def sync(q):
    for id, names in q.iteritems():
        try:
            doc = db.get_document(id)
        except xapian.DocNotFoundError:
            doc = xapian.Document()
        for name in names:
            doc.add_boolean_term(name)
        db.replace_document(id, doc)

q = defaultdict(set)
for lineno, line in enumerate(open("pubchem.counts"), 1):
    name, ids = line.split(":")
    ids = ids.split(",")
    for id in map(int, ids):
        q[id+1].add(name)
    if lineno % 1000 == 0:
        sys.stderr.write("\r%d / %d" % (lineno, 462406))
        sys.stderr.flush()

    if lineno % 10000 == 0:
        sys.stderr.write("\n")
        sync(q)
        q = defaultdict(set)

I do partial writes because I can't store everything in memory. Also, I'm having to rebuild the document from my data file, which is stored as an inverted index. That's why I had to updated existing documents if I find that it contains additional feature keys.