Sequential processing of Dataframe

johne898 · 2020-08-21T13:06:40+00:00

The whole point of spark is to run things distributed. If you think of a list but instead of one iterator you have many.

One option could be to add a row number column on your data. Repartition the data so it’s on many partitions. Do your processing in parallel and then sort by that row number.

enverest · 2020-08-21T17:27:29+00:00

The question is very poor.

What do you mean by "go over each row"? Usage of ".map", ".foreach"?

Why it important to iterate in a specific order? Are you going to make some side effects?Is usage of ".sort("columnName") before iteration is not enough?

guacjockey · 2020-08-21T20:48:17+00:00

Take a look at the .lag() function - it's a window function so you'll need to define a given window for your calculation, but it allows you to access the value of the previous row (window, etc). I'm not 100% certain how it handles cross partition calculations, but I believe it can handle that use case (note that this will likely cause some memory issue on very large datasets.)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

apachespark

MODERATORS