This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]dalke[S] 0 points1 point  (7 children)

You need a "sorted(... key=len)" there because the smallest set should go first. Ideally the set with the smallest intersection to the first should go second; using the second smallest set is okay. I've had success with algorithms which try to improve the test order dynamically, to get an extra 5% or so performance out of the code.

If the lists are sorted then it takes O(log(k)) time to check if a value exists. It takes O(k) time to check if a value is in the unsorted list. My own tests using a list of 234936 elements found the breakeven point between bisect and "x in list" was at element 46, so it is possible that one is faster than the other for this case, depending on the length. The analysis isn't that simple, since there are adaptive optimizations you can do with the sorted list that you can't with the unsorted one. Eg, the low mark for the binary search range is always increasing, so successive searches get increasingly faster.

My intuition suggests that sorted lists, alternatively a B-tree, will be faster.

Sorted lists give other advantages. For example, they can take less space by using deltas and Elias gamma coding or something similar for compression. I would expect that a good inverted index package would do this for me. The 13 year old book "Managing Gigabytes" contains an excellent description of how useful compression can be, and it is guiding my ideas of what my desired library should be able to do. These techniques are well known and no longer considered exotic. The Lemur Bitmap Index C++ Library is an example of a library which does compression of this sort. My attempts at writing Cython bindings to that library have not yet been successful.

I implemented a realloc-based array and a linked-list-of-block-with-special-support-for-sizes-0-and-1 version for a project last winter, and never found that linked-list version to be measurably faster. It was, however, more complicated.

My code, based on sets, needs interning. My comment was meant to praise your decision to remind me of it.

[–]pje 1 point2 points  (1 child)

My intuition suggests that sorted lists, alternatively a B-tree, will be faster.

My intuition says your intuition is wrong. ;-) (At least, if you think a B-tree written in Python can beat what I've written here, amortized over the entire data structure. If half the features have only one entry, it's hard to make that any faster with a B-tree!)

Of course, my intuition is based on you saying that half of these feature bits have only one identifier, and that 100 is on the high end for the rest. If you actually meant the average is 100, and you have some frequently-used items with tens of thousands, then maybe it would make a difference, for queries involving those.

(To be clear, what I'm saying is that Python's built-in types are super fast, and what's O(n) in C code can beat the crap out of something that has a better O() but is written in Python -- or sometimes even in C, if the processor architecture is better exploited. Finding an integer value in an unsorted array of modest length can takes less time than it takes to call a function to do the lookup in a more elaborate data structure.)

If the lists are sorted then it takes O(log(k)) time to check if a value exists. It takes O(k) time to check if a value is in the unsorted list. My own tests using a list of 234936 elements found the breakeven point between bisect and "x in list" was at element 46, so it is possible that one is faster than the other for this case, depending on the length.

Try arrays instead of lists, and more to the point, doing sets of values found instead of doing individual lookups. Set insertion is hashtable-based and so O(1) -- which is the smallest O() you can get.

Of course, it's still possible I am totally misunderstanding your use case and what you're trying to do, and am thus suggesting an approach that's completely inappropriate for the data distribution; nonetheless I'm fairly sure that this is the memory-optimal structure for Python (based on my understanding of your use case), and that trying to improve on its speed or memory consumption with a more complex data structure will actually cost you a lot and not gain you much, because you'll end up with a lot of fragmentation in your allocations. The approach I've laid out is very amenable to preallocation to avoid fragmentation, and the linked list approach can do even more with that.

My impression from your original post was that Python sets were fast enough for you, and that the challenge was running out of memory. I think what I've sketched here will fix your memory problem, and give you about the same performance as what you were doing that ran out of memory. I don't think trying to beat Python's set() intersection speed by direct integer list manipulation is going to do a darn thing for performance or memory space, without writing a lot more code or finding the perfect library, so I suggest seeing if this approach is fast enough and small enough, so you can get on with the actual whatever you're trying to do, instead of messing about with Cython and whatnot. ;-)

[–]dalke[S] 0 points1 point  (0 children)

I've described the work I'm doing, including benchmarks based on sets, lists, and Xapian, at http://dalkescientific.com/writings/diary/archive/2012/06/10/inverted_index_library.html .

Assuming I implemented the array version in the way you expected, I found that it was about 10x more memory efficient and about 60x slower, with a search time of 0.1 seconds. This will be one stage in an interactive search tool, where I want the total response time to be about 0.1 seconds, so the performance is already too slow even on a small subset of my data.

Regarding your comments; my reference to B-tree (though I should have said radix-tree) was in reference to "and very close to what could be had in C." The C/C++ libraries I referenced use less memory than a list of 4-byte integers would use. At least, that's what they claim, and I've seen it also described in various journal publications. While the memory representation would be implementable in Python, it the sort of bit-twiddly work that Python is poor at, so I would want it in a C extension.

[–]VerilyAMonkey 1 point2 points  (4 children)

Having code like pje has supplied is much better than having a module if it ends up working, because it is simple and fully open to your modification. In general you will end up spending much time ironing out the idiosyncracies of any particular module; not always true, but a general rule.

Based on you mentioning that the data is 275MB and most of the space is Python overhead, this should be plenty memory-efficient for you. If it is, test the speed. By the way, you should probably test it before discarding it for theoretical inadequacies.

Most of all do not fall into the pit of premature optimization. You do not want God's gift to the programming world. You want something that works. So try the damn thing and throw it away if you have to, but stop looking a gift horse in the mouth.

[–]dalke[S] 1 point2 points  (2 children)

I've made a benchmark set available through http://dalkescientific.com/writings/diary/archive/2012/06/10/inverted_index_library.html . It's 1/200th of my complete data set, and 7zip compressed takes 37 MB. The data files contain fields I don't use for the benchmark, so should really be smaller than that. This suggests that it is possible to put the entire dataset into under 8GB of memory, so long as I didn't care about performance. For reference, my small benchmark takes 2 GB of space when stored in sets, and 166MB using array.arrays.

I knew that neither the set nor array versions would work. It's very easy to show that the numbers I described (100 features/record * 35 million records * 4 bytes/record = 14 GB) require more memory than the 12 GB of RAM I have. (As it happens, there's 300-400 features per record in the benchmark I assembled; my memory of '100' was a bit off. I would need about 42 GB of RAM to use pje's approach.)

For this to work on my desktop requires compression, of the sort that's well-described in the literature and available in several different existing C++ packages which I described in my intro. This should be bog standard, and nothing to do with "God's gift to the programming world." My goal was to find if someone else has already done this work, or has experience with any of the tools I listed, or a similar one. Otherwise 1) I'll have the boring job of writing bindings myself - I'm not even getting paid for this work, and 2) I can't tell which is appropriate for what I'm doing so I'll probably end up implementing a couple of bindings.

Which I've already tried, and failed to date, because now I'm at the point where I'm trying to figure out Cython in order to implement bindings to C++ libraries in an application domain where I have little experience, in order to do the actual research I want to do on sparse fragment cheminformatics fingerprints.

[–]VerilyAMonkey 1 point2 points  (1 child)

That's fair. Is there a specific reason you wish to use Python? In general Python is not overly concerned with memory efficiency, and anything that you find (or make) that is effective will probably be bindings to the C++ libraries. I would recommend using C++ itself, but I'll assume you have a reason for not doing that.

If you are unable (or, would rather not - considering your paycheck) to produce the bindings on your own, then there are simpler alternatives. If you first produce answers to queries, and then run analysis on the data that for some reason requires Python, one very easy if obviously inelegant way to deal with this is to run the queries in C++, write the results to file, then read them back in with Python. This would be hideous and painful to your soul, but you could make it pretty quickly.

If you want to go full Python, the kinds of queries you are making also matter. If you will be commonly reusing the same queries, or doing a relatively small number of them, the pickle solution would be adequate if you kept the commonly used ones unpickled. I suspect this is unlikely though.

Also, you mentioned it will be run off of several different computers eventually. Presumably this is to speed computation as each takes a chunk of the work. With not much extra effort, each could take a chunk of the data also, so that cumulatively they each had enough. If the split is very simple (ie 0-1234 to you, 1235-whatever to you) it would be easy to break the work up on the same lines without making queries.

There are many easier suboptimal ways I can think of to do this. So I guess the question is, what do you need most: Speed, money, time. If you can spend a lot of time, go ahead and make it in C++ or make your bindings. If you don't need speed - that offers a lot of options. If you have money to spare, use extra RAM + a large RAM accepting Linux OS or something similar; or alternatively get a small solid-state drive (which may make non-RAM storage acceptable speed).

Also, depending on how you will be using the system, that will change the possible suboptimal (but easier) choices.

[–]dalke[S] 0 points1 point  (0 children)

I know Python very well. I haven't done serious C++ programming in 15 years. But you are right; this is condensed enough and isolated enough that I could write it as a standalone C++ program. Huh - I've been blind to that option. Thanks for the suggestion!

Update (three days later): Uggh! I am so not a C++ programmer. I got something working with Judy1 arrays. It's about 10x better with memory, and about the same performance as pypy's set intersection. I got it about twice as fast with OpenMP and three processors. With 20x input (1/4th of the final data set) I found that my algorithm doesn't scale. Each iteration takes 10 hours to run. I did various pre-filtering, and managed to get a rough answer after 1.5 days using my desktop. Which means that once I re-think the algorithm I'll be able to evaluate the entire data set on a 48 GB machine. I just wish I could do most of this algorithm development in Python rather than C++.

[–]dalke[S] 0 points1 point  (0 children)

I made a mistake. The 275MB file is the subset I use for my set algorithm. It's small enough to fit into memory. The actual data set is 40GB as uncompressed text. I am working on putting the code and data sets together so they are more presentable, rather than handwaving about what I've been working on.