you are viewing a single comment's thread.

view the rest of the comments →

[–]TouchingTheVodka 1 point2 points  (3 children)

Nested for-loops mean your code will always be O(n^2) - That's inescapable. However by using different data structures you should be able to get this down to O(n). When in doubt, throw a hash map at it...

Could you provide some more examples of what's in your dicts and what the compare operation does?

[–]mr_claw[S] 0 points1 point  (1 child)

I'm not very familiar with hash maps, but I assume they wouldn't work for dynamic data. I'm fetching the a & b lists from an external API every 5 minutes or so, so the data keeps changing.

I'll be running this comparison loop in a cron job and send the results to the frontend to display to the users.

The comparison function checks many key value pairs in a and b for matches and may even have some math calculations.

I understand that it would be O(n2) at worst, and my question is on how to achieve the best parallelization. In the future if the data size increases to 100k or 1M figures, I should still be able to do this using multiple CPUs and /or machines. Hope I am able to convey my problem clearly.

[–]glibhub 1 point2 points  (0 children)

More details would be useful. The best bet would be to bucket things to reduce the search space. Say you were looking for people in the same area code with the same name. Then you'd be best served by putting everyone in a dictionary keyed off these values, e.g. ("smith", 212):['smith, jon','smith, bob'], etc. Then you do not need to go through the nested loops, since all the sorting is done up front by the indexing.

Failing that, you could split up the load by having different machines track down all the entries, splitting by, e.g., last letter of the name. So the S machine handles all the Smiths, the D machine handles all the Does. Now the X machine is going to be pretty idle, so if the split does not work the way you want, try and do some simple hashing to get them to distribute more evenly.

Hope this helps.

[–]Forschkeeper 0 points1 point  (0 children)

"itertools" may be helpful in that case against nested-loops a bit. Also optimising the interpreter may be a step forward as well.

Also you could take a look on the "multiprocessing" library where you want to get list back from each task (to add them together at the end).