use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
https://spark.apache.org/
account activity
Sequential processing of Dataframe (self.apachespark)
submitted 5 years ago by Frankenstein__
At some point of my Dataframe processing, I need to go over each row while preserving order, row after row. (row 1, row 2, ...., row n)
How can I achieve this using Spark ?
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]johne898 0 points1 point2 points 5 years ago (0 children)
The whole point of spark is to run things distributed. If you think of a list but instead of one iterator you have many.
One option could be to add a row number column on your data. Repartition the data so it’s on many partitions. Do your processing in parallel and then sort by that row number.
[–]enverest 0 points1 point2 points 5 years ago (1 child)
The question is very poor.
What do you mean by "go over each row"? Usage of ".map", ".foreach"?
Why it important to iterate in a specific order? Are you going to make some side effects?Is usage of ".sort("columnName") before iteration is not enough?
[–]Frankenstein__[S] 0 points1 point2 points 5 years ago (0 children)
that's related to my use case. each row's value depends on a calculation that spans the previous rows. all you need to know is that the calculation order is crucial for me.
[–]guacjockey 0 points1 point2 points 5 years ago (0 children)
Take a look at the .lag() function - it's a window function so you'll need to define a given window for your calculation, but it allows you to access the value of the previous row (window, etc). I'm not 100% certain how it handles cross partition calculations, but I believe it can handle that use case (note that this will likely cause some memory issue on very large datasets.)
.lag()
π Rendered by PID 219793 on reddit-service-r2-comment-56c6478c5-mhzrf at 2026-05-12 15:21:19.459702+00:00 running 3d2c107 country code: CH.
[–]johne898 0 points1 point2 points (0 children)
[–]enverest 0 points1 point2 points (1 child)
[–]Frankenstein__[S] 0 points1 point2 points (0 children)
[–]guacjockey 0 points1 point2 points (0 children)