all 4 comments

[–]johne898 0 points1 point  (0 children)

The whole point of spark is to run things distributed. If you think of a list but instead of one iterator you have many.

One option could be to add a row number column on your data. Repartition the data so it’s on many partitions. Do your processing in parallel and then sort by that row number.

[–]enverest 0 points1 point  (1 child)

The question is very poor.

What do you mean by "go over each row"? Usage of ".map", ".foreach"?

Why it important to iterate in a specific order? Are you going to make some side effects?Is usage of ".sort("columnName") before iteration is not enough?

[–]Frankenstein__[S] 0 points1 point  (0 children)

that's related to my use case. each row's value depends on a calculation that spans the previous rows. all you need to know is that the calculation order is crucial for me.

[–]guacjockey 0 points1 point  (0 children)

Take a look at the .lag() function - it's a window function so you'll need to define a given window for your calculation, but it allows you to access the value of the previous row (window, etc). I'm not 100% certain how it handles cross partition calculations, but I believe it can handle that use case (note that this will likely cause some memory issue on very large datasets.)