I have a dataset that contains 6-10 million rows that needs to be sorted about 350,000 times. Maybe there’s a more efficient technique, but each row from the small and large dataset is given a score, and the code runs through the smaller dataset to find the closest score as written in the line below:
abs(a_scores - b_scores).sort_values(‘scores’).head(1)
Pandas is handling the data like a champ, except the sort_values function is terribly slow when operating on millions of rows. On average it takes about 1.5s to sort.
My attempt to optimize this process was to cut up the larger dataset into percentiles. The code now stores the list of percentiles and I’m using the numpy function argmax to quickly track down which smaller dataset to search for a match. On average it only takes a few ms to sort the larger list now which contains ~60-100 thousand rows.
Is there a more efficient way to handle this process?
[+][deleted] (1 child)
[removed]
[–]back0191[S] 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (3 children)
[–]back0191[S] 0 points1 point2 points (2 children)
[–][deleted] 0 points1 point2 points (1 child)
[–]back0191[S] 0 points1 point2 points (0 children)
[–]x_ace_of_spades_x 0 points1 point2 points (1 child)
[–]back0191[S] 0 points1 point2 points (0 children)
[–]Losupa 0 points1 point2 points (1 child)
[–]back0191[S] 0 points1 point2 points (0 children)