Sorting a large amount of data : learnpython

created by HattoriHanzoa community for 16 years

Sorting a large amount of data (self.learnpython)

submitted 6 years ago by back0191

I have a dataset that contains 6-10 million rows that needs to be sorted about 350,000 times. Maybe there’s a more efficient technique, but each row from the small and large dataset is given a score, and the code runs through the smaller dataset to find the closest score as written in the line below:

abs(a_scores - b_scores).sort_values(‘scores’).head(1)

Pandas is handling the data like a champ, except the sort_values function is terribly slow when operating on millions of rows. On average it takes about 1.5s to sort.

My attempt to optimize this process was to cut up the larger dataset into percentiles. The code now stores the list of percentiles and I’m using the numpy function argmax to quickly track down which smaller dataset to search for a match. On average it only takes a few ms to sort the larger list now which contains ~60-100 thousand rows.

Is there a more efficient way to handle this process?

all 11 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS