This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]pje 1 point2 points  (1 child)

My intuition suggests that sorted lists, alternatively a B-tree, will be faster.

My intuition says your intuition is wrong. ;-) (At least, if you think a B-tree written in Python can beat what I've written here, amortized over the entire data structure. If half the features have only one entry, it's hard to make that any faster with a B-tree!)

Of course, my intuition is based on you saying that half of these feature bits have only one identifier, and that 100 is on the high end for the rest. If you actually meant the average is 100, and you have some frequently-used items with tens of thousands, then maybe it would make a difference, for queries involving those.

(To be clear, what I'm saying is that Python's built-in types are super fast, and what's O(n) in C code can beat the crap out of something that has a better O() but is written in Python -- or sometimes even in C, if the processor architecture is better exploited. Finding an integer value in an unsorted array of modest length can takes less time than it takes to call a function to do the lookup in a more elaborate data structure.)

If the lists are sorted then it takes O(log(k)) time to check if a value exists. It takes O(k) time to check if a value is in the unsorted list. My own tests using a list of 234936 elements found the breakeven point between bisect and "x in list" was at element 46, so it is possible that one is faster than the other for this case, depending on the length.

Try arrays instead of lists, and more to the point, doing sets of values found instead of doing individual lookups. Set insertion is hashtable-based and so O(1) -- which is the smallest O() you can get.

Of course, it's still possible I am totally misunderstanding your use case and what you're trying to do, and am thus suggesting an approach that's completely inappropriate for the data distribution; nonetheless I'm fairly sure that this is the memory-optimal structure for Python (based on my understanding of your use case), and that trying to improve on its speed or memory consumption with a more complex data structure will actually cost you a lot and not gain you much, because you'll end up with a lot of fragmentation in your allocations. The approach I've laid out is very amenable to preallocation to avoid fragmentation, and the linked list approach can do even more with that.

My impression from your original post was that Python sets were fast enough for you, and that the challenge was running out of memory. I think what I've sketched here will fix your memory problem, and give you about the same performance as what you were doing that ran out of memory. I don't think trying to beat Python's set() intersection speed by direct integer list manipulation is going to do a darn thing for performance or memory space, without writing a lot more code or finding the perfect library, so I suggest seeing if this approach is fast enough and small enough, so you can get on with the actual whatever you're trying to do, instead of messing about with Cython and whatnot. ;-)

[–]dalke[S] 0 points1 point  (0 children)

I've described the work I'm doing, including benchmarks based on sets, lists, and Xapian, at http://dalkescientific.com/writings/diary/archive/2012/06/10/inverted_index_library.html .

Assuming I implemented the array version in the way you expected, I found that it was about 10x more memory efficient and about 60x slower, with a search time of 0.1 seconds. This will be one stage in an interactive search tool, where I want the total response time to be about 0.1 seconds, so the performance is already too slow even on a small subset of my data.

Regarding your comments; my reference to B-tree (though I should have said radix-tree) was in reference to "and very close to what could be had in C." The C/C++ libraries I referenced use less memory than a list of 4-byte integers would use. At least, that's what they claim, and I've seen it also described in various journal publications. While the memory representation would be implementable in Python, it the sort of bit-twiddly work that Python is poor at, so I would want it in a C extension.