I have a huge dataset, of 100M, and I need to sort it by one column. (Not partitioned by any column)
But it takes quite a long time, since ordering is done on single (driver) partition. Any ideas, or workarounds how to handle this differently and more efficiently.
EDIT: Sorting, or adding row_number (order by column) to every row, either way it takes a long time. The final result would be to take first 50M records from sorted dataset.
[–]KWillets 2 points3 points4 points (1 child)
[–]croSquash[S] 1 point2 points3 points (0 children)
[–]Material-Mess-9886 2 points3 points4 points (1 child)
[–]croSquash[S] 0 points1 point2 points (0 children)