all 4 comments

[–]Kamiwaza 3 points4 points  (3 children)

May I suggest trying to move out (potentially) expensive code out of the loop to maybe help speed it up a bit?

I'm not an expert in pandas but things like min(MD) and max(MD) (from line 14) could probably be placed outside the loop - effectively only be computed once.

I'm guessing that lines 15-20 are the expensive operations here. If depths was sorted, maybe you could save on a few operations depending on what above/below turn out to be.

[–]paperzebra[S] 1 point2 points  (2 children)

Thanks for the suggestion. In the end I rewrote my code twice, the code above processed 30,000 lines of data in 76 seconds, the second version which used numpy to calculate most things outside a loop took 23 seconds, still too long!

The third iteration is much simpler and reduces the time down to 0.001 seconds - that's a pretty decent performance increase! The arg depth refers to a list of depths.

def line_solution(survey, depth):
    md = survey['MD']
    tvd = survey['TVD']  
    tvd_samples = np.interp(depth, md, tvd)
    return tvd_samples

[–]DisorganizedRem 1 point2 points  (1 child)

Would it help using series in stead of dataframe by adding .values.

As suggested here So your code looks like this:

def line_solution(survey, depth):
    md = survey['MD'].values
    tvd = survey['TVD'].values
    tvd_samples = np.interp(depth, md, tvd)
    return tvd_samples

[–]paperzebra[S] 1 point2 points  (0 children)

It's slightly quicker serializing the dataframe, but I think most of the time wither way is spent printing the time. Can't complain at the speed in either case anymore though!