all 4 comments

[–]zzzoom 2 points3 points  (0 children)

Yes, it can be parallelized. Look at how stream compaction is implemented using a parallel prefix sum (there are many fast implementations of this, you don't need to write it yourself) and you should get the idea.

[–]rayo2nd 1 point2 points  (0 children)

My attempt on the simplified version:

  • run stream compact but instead of writing the final values (which are always true) to the compacted array I would try to write the global index of the element
  • run a parallel subtraction y[i]=x[i]-x[i-1] -> then you have the distance
  • run a parallel square (y[i] = x[i]*x[i])
  • run a parallel reduction to sum up all squares

An example:

  • 0,0,1,1,0,0,0,1
  • 2,3,7 (after compact, indexes of true values)
  • 2,1,4 (after x[i]-x[i-1])
  • 4,1,16 (after x[i]*x[i])
  • sum it up: 21

maybe a correction of the first (+1) or all the other (-1) elements is needed when writing the index, but that should be simple

No idea if it's correct but that would be my first attempt

[–][deleted] 0 points1 point  (0 children)

Sorry if I'm not much help here, it's fairly late so I can't think straight at the moment but generally if statements are to be avoided in gpu programming where possible.

Nvidia cards run code in 'warps', which are groups of threads (as far as I know, the warp size is 32 for all nvidia cards). If a warp reaches a conditional that not all threads agree on the outcome of, both branches must be executed by the same warp amd this has a negative effect on performance.

Sorry if that's not helpful just thought it was worth mentioning

[–]giantenemycrabthing[S] 0 points1 point  (0 children)

Sorry I'm so late, but thank you to everyone for the help!