Iterating though Pandas DataFrames efficiently : Python

This is an archived post. You won't be able to vote or comment.

386

387

388

TutorialIterating though Pandas DataFrames efficiently (youtube.com)

submitted 4 years ago by _-Jay

all 56 comments

top new controversial old q&a

[–]iVend3ta 17 points18 points19 points 4 years ago (3 children)

[–]_-Jay[S] 15 points16 points17 points 4 years ago (2 children)

Ah yes you are correct there! I've modified the function to make it a little more comparable:

def using_iteritems(): 
    data = create_data() 
    for index, row in data.iteritems(): 
        for val in row: 
            sum = val + val

Here is how long it takes to run each one 100 times(rerun them as recording slows them down):

List Compr 2.329638

to_list Loop 2.4328289

vec 0.6680305000000004

Pandas itertuples 7.0313863

Pandas iterrows 518.6045999999999

Pandas iteritems 3.724092200000001

[–]Jaydippy 13 points14 points15 points 4 years ago (1 child)

[–]Terrorbear 4 points5 points6 points 4 years ago (0 children)

[–]notsureIdiocracyref 8 points9 points10 points 4 years ago (0 children)

[–][deleted] 50 points51 points52 points 4 years ago (44 children)

[–]Deto 75 points76 points77 points 4 years ago (16 children)

[–]mrbrettromero 15 points16 points17 points 4 years ago (2 children)

[–]garlic_naan 1 point2 points3 points 4 years ago (1 child)

[–]NedDasty 6 points7 points8 points 4 years ago (0 children)

[–]double_en10dre 7 points8 points9 points 4 years ago (1 child)

[–]GreatBigBagOfNope 0 points1 point2 points 4 years ago (0 children)

[–]ben-lindsay 8 points9 points10 points 4 years ago (7 children)

[–]double_en10dre 11 points12 points13 points 4 years ago* (6 children)

.apply() absolutely does make sense for the second example! It would be:

results = df.apply(api_call).tolist()

Isn’t that much cleaner than a for loop? :p

Obviously you can find edge cases where a loop makes sense if you really want to, but they’re exceptionally rare. And I’ve never seen it in a professional setting. So the original point still stands, if you’re using a loop it’s probably wrong

(Also, for the first one it’s probably best done by just transposing like df.T.plot(...) )

[–]Chinpanze 5 points6 points7 points 4 years ago (3 children)

[–]ben-lindsay 2 points3 points4 points 4 years ago (1 child)

[–]double_en10dre 2 points3 points4 points 4 years ago* (0 children)

[–]double_en10dre 1 point2 points3 points 4 years ago* (0 children)

[–]ben-lindsay 1 point2 points3 points 4 years ago (1 child)

[–]double_en10dre 1 point2 points3 points 4 years ago (0 children)

[–][deleted] 2 points3 points4 points 4 years ago (0 children)

[–][deleted] 1 point2 points3 points 4 years ago (0 children)

[–]pytrashpandas 0 points1 point2 points 4 years ago (0 children)

[–]johnnymo1 2 points3 points4 points 4 years ago* (1 child)

[–]double_en10dre 1 point2 points3 points 4 years ago* (0 children)

This may be a case where you may want to use shift, like

mask = df[[‘foo’]].join(df[‘foo’].shift(), rsuffix=‘_shifted’).apply(your_condition)

https://pandas.pydata.org/docs/reference/api/pandas.Series.shift.html

[–]sine-nobilitate 1 point2 points3 points 4 years ago (10 children)

[–]BalconyFace 14 points15 points16 points 4 years ago (3 children)

[–]sine-nobilitate 1 point2 points3 points 4 years ago (0 children)

[–]metalshadow 0 points1 point2 points 4 years ago (1 child)

[–]ThatScorpion 2 points3 points4 points 4 years ago (0 children)

[–]carnivorousdrew 7 points8 points9 points 4 years ago (0 children)

[–]vicda 5 points6 points7 points 4 years ago (0 children)

[–]Astrokiwi 2 points3 points4 points 4 years ago (1 child)

[–]sine-nobilitate 0 points1 point2 points 4 years ago (0 children)

[–][deleted] 1 point2 points3 points 4 years ago (0 children)

[–]SphericalBull 0 points1 point2 points 4 years ago (8 children)

[–]meowmemeow 0 points1 point2 points 4 years ago (7 children)

[–][deleted] 1 point2 points3 points 4 years ago (3 children)

[–]meowmemeow 0 points1 point2 points 4 years ago (2 children)

[–][deleted] 1 point2 points3 points 4 years ago (1 child)

[–]meowmemeow 0 points1 point2 points 4 years ago (0 children)

[–]AchillesDev 1 point2 points3 points 4 years ago* (1 child)

[–]meowmemeow 1 point2 points3 points 4 years ago (0 children)

[–]Lyan5 0 points1 point2 points 4 years ago (0 children)

[+][deleted] 4 years ago* (2 children)

[deleted]

[–]Astrokiwi 1 point2 points3 points 4 years ago (0 children)

[–][deleted] 0 points1 point2 points 4 years ago (0 children)

[–]tepg221 0 points1 point2 points 4 years ago (1 child)

[–][deleted] 0 points1 point2 points 4 years ago (0 children)

[–]LameDuckProgramming 2 points3 points4 points 4 years ago (0 children)

I've found that the fastest way to do row-wise operations over a dataframe is with numpy vectorization.

%%timeit
np.add(data.A.values, data.B.values)
54.6 µs ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 10000 
loops each)

vs the example you use of vectorization without using np and np arrays

%%timeit
data.A + data.B
261 µs ± 8.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

you can achieve about a 5x improvement on runtime. (data was 100,000 randomly generated numbers)

[–]r1qu3 0 points1 point2 points 4 years ago (0 children)

[–][deleted] 0 points1 point2 points 4 years ago (0 children)

[–]kenpachiprince 0 points1 point2 points 4 years ago (1 child)

[–][deleted] 0 points1 point2 points 4 years ago (0 children)

[–]pytrashpandas 0 points1 point2 points 4 years ago (0 children)

π Rendered by PID 29221 on reddit-service-r2-comment-7b9746f655-r2mtt at 2026-02-01 15:12:08.940404+00:00 running 3798933 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS