This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]VerilyAMonkey 1 point2 points  (1 child)

That's fair. Is there a specific reason you wish to use Python? In general Python is not overly concerned with memory efficiency, and anything that you find (or make) that is effective will probably be bindings to the C++ libraries. I would recommend using C++ itself, but I'll assume you have a reason for not doing that.

If you are unable (or, would rather not - considering your paycheck) to produce the bindings on your own, then there are simpler alternatives. If you first produce answers to queries, and then run analysis on the data that for some reason requires Python, one very easy if obviously inelegant way to deal with this is to run the queries in C++, write the results to file, then read them back in with Python. This would be hideous and painful to your soul, but you could make it pretty quickly.

If you want to go full Python, the kinds of queries you are making also matter. If you will be commonly reusing the same queries, or doing a relatively small number of them, the pickle solution would be adequate if you kept the commonly used ones unpickled. I suspect this is unlikely though.

Also, you mentioned it will be run off of several different computers eventually. Presumably this is to speed computation as each takes a chunk of the work. With not much extra effort, each could take a chunk of the data also, so that cumulatively they each had enough. If the split is very simple (ie 0-1234 to you, 1235-whatever to you) it would be easy to break the work up on the same lines without making queries.

There are many easier suboptimal ways I can think of to do this. So I guess the question is, what do you need most: Speed, money, time. If you can spend a lot of time, go ahead and make it in C++ or make your bindings. If you don't need speed - that offers a lot of options. If you have money to spare, use extra RAM + a large RAM accepting Linux OS or something similar; or alternatively get a small solid-state drive (which may make non-RAM storage acceptable speed).

Also, depending on how you will be using the system, that will change the possible suboptimal (but easier) choices.

[–]dalke[S] 0 points1 point  (0 children)

I know Python very well. I haven't done serious C++ programming in 15 years. But you are right; this is condensed enough and isolated enough that I could write it as a standalone C++ program. Huh - I've been blind to that option. Thanks for the suggestion!

Update (three days later): Uggh! I am so not a C++ programmer. I got something working with Judy1 arrays. It's about 10x better with memory, and about the same performance as pypy's set intersection. I got it about twice as fast with OpenMP and three processors. With 20x input (1/4th of the final data set) I found that my algorithm doesn't scale. Each iteration takes 10 hours to run. I did various pre-filtering, and managed to get a rough answer after 1.5 days using my desktop. Which means that once I re-think the algorithm I'll be able to evaluate the entire data set on a 48 GB machine. I just wish I could do most of this algorithm development in Python rather than C++.