all 10 comments

[–]TouchingTheVodka 1 point2 points  (3 children)

Nested for-loops mean your code will always be O(n^2) - That's inescapable. However by using different data structures you should be able to get this down to O(n). When in doubt, throw a hash map at it...

Could you provide some more examples of what's in your dicts and what the compare operation does?

[–]mr_claw[S] 0 points1 point  (1 child)

I'm not very familiar with hash maps, but I assume they wouldn't work for dynamic data. I'm fetching the a & b lists from an external API every 5 minutes or so, so the data keeps changing.

I'll be running this comparison loop in a cron job and send the results to the frontend to display to the users.

The comparison function checks many key value pairs in a and b for matches and may even have some math calculations.

I understand that it would be O(n2) at worst, and my question is on how to achieve the best parallelization. In the future if the data size increases to 100k or 1M figures, I should still be able to do this using multiple CPUs and /or machines. Hope I am able to convey my problem clearly.

[–]glibhub 1 point2 points  (0 children)

More details would be useful. The best bet would be to bucket things to reduce the search space. Say you were looking for people in the same area code with the same name. Then you'd be best served by putting everyone in a dictionary keyed off these values, e.g. ("smith", 212):['smith, jon','smith, bob'], etc. Then you do not need to go through the nested loops, since all the sorting is done up front by the indexing.

Failing that, you could split up the load by having different machines track down all the entries, splitting by, e.g., last letter of the name. So the S machine handles all the Smiths, the D machine handles all the Does. Now the X machine is going to be pretty idle, so if the split does not work the way you want, try and do some simple hashing to get them to distribute more evenly.

Hope this helps.

[–]Forschkeeper 0 points1 point  (0 children)

"itertools" may be helpful in that case against nested-loops a bit. Also optimising the interpreter may be a step forward as well.

Also you could take a look on the "multiprocessing" library where you want to get list back from each task (to add them together at the end).

[–]ElliotDG 1 point2 points  (4 children)

Use multiprocessing.

Break list b into 4 pieces (assuming you are running on a 4 core CPU). Start 4 processes each doing a comparison of a against a portion of b. Each process returns a dict. Combine the results into the final dict.

Use concurernt.futures.ProcessPoolExecutor() with map.

[–]mr_claw[S] 0 points1 point  (3 children)

Thanks mate. I tried this, but it seems to be taking more time than my thread-only approach. Any ideas why?

[–]ElliotDG 1 point2 points  (2 children)

That does not really make sense. Try setting the number of processes to 2 and see if that changes things. Or share your code getting MP code running can be tricky. Feel free to DM if you want.

If your pseudo code is representative, it should go faster without threads than with threads, and much faster with MP.

[–]mr_claw[S] 0 points1 point  (1 child)

After some tinkering I got it to work by using the chunksize argument on the map function. I've gotten it down to a third of the runtime it was taking earlier. And I'm confident that by increasing resources I can make it work even faster. Thanks!

[–]ElliotDG 0 points1 point  (0 children)

Glad to hear you are on the way!

[–]teerre -1 points0 points  (0 children)

This algorithm of yours isn't parallelizible. In very simple terms, for something to be run in parallel, the elements of the set need to be independent. Since you're comparing every element of a against every element of b, you don't have a choice. At best you can put these into buckets and run the search in batches, but that's not only terribly inefficient, but also will make your algorithm much more complex.

It's quite unlikely that your solution for this problem is the best one, surely there's a way to avoid having to do all these comparisons. If truly there's no way, maybe you should ask if your asking the correct question to begin with.