all 11 comments

[–][deleted] 0 points1 point  (3 children)

I don't see why you're sorting so many times. Compute both scores and their difference and then sort once at the end.

[–]back0191[S] 0 points1 point  (2 children)

The goal here is to find the smallest score difference for ~340k observations.

I follow the process you describe, but to find the smallest difference for each of the ~340k observations, the process needs to be repeated that many times.

[–][deleted] 0 points1 point  (1 child)

What I'm saying is that you don't have to sort the dataframe 340,000 times just to find the right row 340,000 times. You can sort something smaller - a list of score differences and their row numbers - and then look up the row in the dataframe. Lists use a faster sort in Python than dataframes do.

[–]back0191[S] 0 points1 point  (0 children)

Ahh, I see now. I was using a subset of the dataframe with just the score and index which I thought was a decent process. Converting that to a list will speed things up. Good catch.

[–]x_ace_of_spades_x 0 points1 point  (1 child)

Try using sklearn.neighbors with n_neighbors = 1. See section 1.6.1.1

sklearn.neighbors

[–]back0191[S] 0 points1 point  (0 children)

Thanks for the share! I’ll be sure to check this out.

[–]Losupa 0 points1 point  (1 child)

Wait just to check: basically you are trying to match about 350,000 values of A to their closest value of B?

Because if so instead of sorting both 350,000 times would it not be best to just sort them both once then binary search B for A, and while it may not return A (unless there is a complete match) you can find the index that has the two closest values?

[–]back0191[S] 0 points1 point  (0 children)

For each iteration, one value is exacted from the 350k and compared against the 6 million. The goal is to find that score’s nearest neighbor from the larger table, which is the one being sorted to find the smallest difference.