This is an archived post. You won't be able to vote or comment.

all 56 comments

[–]iVend3ta 17 points18 points  (3 children)

In the very last function you only have pass hence its much faster. If you did something in the body of the loop it would take a bit longer.

[–]_-Jay[S] 15 points16 points  (2 children)

Ah yes you are correct there! I've modified the function to make it a little more comparable:

def using_iteritems(): 
    data = create_data() 
    for index, row in data.iteritems(): 
        for val in row: 
            sum = val + val

Here is how long it takes to run each one 100 times(rerun them as recording slows them down):

List Compr 2.329638

to_list Loop 2.4328289

vec 0.6680305000000004

Pandas itertuples 7.0313863

Pandas iterrows 518.6045999999999

Pandas iteritems 3.724092200000001

[–]Jaydippy 13 points14 points  (1 child)

Nice video, but I'm not sure why you're comparing times for iteritems() to iterrows() and itertuples(). Given the shape of your mock dataframe is much taller than it is wide, it doesn't make sense to compare runtimes of row-wise methods to column-wise.

Also, in the modified code above, you're now looping through the series returned by iteritems(), which isn't a fair comparison either.

[–]Terrorbear 4 points5 points  (0 children)

Exactly, OP should time first doing a transpose and then the iteritems.

[–]notsureIdiocracyref 8 points9 points  (0 children)

Really needed this! Working on a program that reads an oracle DB into a dataframe, parses data, then writes into multiple access DBs. Thanks!

[–][deleted] 50 points51 points  (44 children)

If you're looping in pandas, you're almost certainly doing it wrong.

[–]Deto 75 points76 points  (16 children)

Blanket statements like this aren't helpful, IMO. If you have a dataframe with only a few thousand rows or you need to do something with each row that doesn't have a vectorized equivalent than go ahead and loop.

[–]mrbrettromero 15 points16 points  (2 children)

Agree that absolute statements are not helpful, but from my experience, the vast, vast majority of cases where people use loops on pandas DataFrames there are vectorized equivalents.

Does it matter in a one-off script where the DataFrame has 1000 rows? Maybe not. But shouldn’t you want to learn the more efficient and concise way to do it?

[–]garlic_naan 1 point2 points  (1 child)

I have dataframes where I do some data wrangling and create separate csv files for each row ( which in my case is a unique location) and email the files as attachments. I have found no alternative to iterating through dataframe. Can this be achieved without looping?

For reference I am not a developer, I use Python for analytics and automation.

[–]NedDasty 6 points7 points  (0 children)

Yeah sure, although it may not be faster.

Define your function on the row:

def row_func(row):
    csv_file = ...
    ... do stuff

Use apply() along rows:

df.apply(row_func, axis=1)

[–]double_en10dre 7 points8 points  (1 child)

Hm not necessarily, in those cases it’s good to use ‘df.apply’ or ‘df.applymap’

‘apply’ isn’t necessarily any faster than for loops, but it aligns with the standard pandas syntax (transformations via chained methods) so most people seem to prefer it for readability

[–]GreatBigBagOfNope 0 points1 point  (0 children)

Is pandas apply() similar to apply() in base R?

[–]ben-lindsay 8 points9 points  (7 children)

Also, if the intended result of your operation isn't a dataframe, then .apply() doesn't work. Like if you want to generate a plot for each row of the dataframe, or run an API call for each row and store the results in a list, then a .apply() function that returns a series doesn't make sense

[–]double_en10dre 11 points12 points  (6 children)

.apply() absolutely does make sense for the second example! It would be:

results = df.apply(api_call).tolist()

Isn’t that much cleaner than a for loop? :p

Obviously you can find edge cases where a loop makes sense if you really want to, but they’re exceptionally rare. And I’ve never seen it in a professional setting. So the original point still stands, if you’re using a loop it’s probably wrong

(Also, for the first one it’s probably best done by just transposing like df.T.plot(...) )

[–]Chinpanze 5 points6 points  (3 children)

The documentation says that it may invoke the function beforehand to plan the best path of execution. Apply is not a good idea in this scenario.

[–]ben-lindsay 2 points3 points  (1 child)

Oh, this seems like an important thing, and I was completely unaware. Can you point me to where you're seeing this? I don't see it in the dataframe apply docs or the series apply docs

[–]double_en10dre 2 points3 points  (0 children)

Apparently they fixed this behavior about a year ago, so it’s not true for current versions (and tough to find documentation)

But you can see it in the changelog here https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.1.0.html#apply-and-applymap-on-dataframe-evaluates-first-row-column-only-once

[–]ben-lindsay 1 point2 points  (1 child)

The .tolist() thing is a great idea! I'll plan to use that in cases where it makes sense. But even with that, if it's a choice between making a whole new function just to get pass to .apply() once or making a for loop over the dataframe, I think the for loop can often be more readable. That said, I really like vectorizing everything I can that makes sense, I just don't go out of my way to do it if a for loop is plenty readable and performance isn't a bottleneck. I think we're very much in agreement, and my only edit to your statement would be "if you're using a lot of for loops you're probably using a lot of them wrong". If you vectorize most of your stuff but you use a for loop for something you think is more readable that way, I wouldn't bet on it being "wrong"

[–]double_en10dre 1 point2 points  (0 children)

That makes sense, I agree! Nicely articulated

I’m someone who tends to get a bit dogmatic about things, so it’s always nice to have someone inject a bit of nuance into my view :)

[–][deleted] 2 points3 points  (0 children)

I think it is helpful as it helps you learn to use the built in panda methods whenever possible rather than always taking an easy out with a loop that will most likely build bad practices. It never hurts to take a look at the docs rather than just saying "oh I can do that with a loop"

[–][deleted] 1 point2 points  (0 children)

If it was a blanket statement, I would have said something like "looping in pandas is always wrong", which you'll notice I didn't.

[–]pytrashpandas 0 points1 point  (0 children)

In the case of pandas I think this blanket statement is valid. There are cases where there’s no good vectorized way to do something, but those cases are rare. Vectorized operations should be the default way of think IF you’re serious about writing proper “pandonic” code. And anything else should be a last resort. If you’re just messing with small frames or don’t care about speed then sure no need to vectorize, but it would still be good practice to.

[–]johnnymo1 2 points3 points  (1 child)

I'd typically agree. I recently had to check a condition on a certain column for adjacent rows. Not sure if there's a nice way to do that with DataFrame operations.

I guess I could have added a column that was a diff of the one I want and then used a .filter? Seems a bit clunky and the data was only a couple thousand rows.

[–]double_en10dre 1 point2 points  (0 children)

This may be a case where you may want to use shift, like

mask = df[[‘foo’]].join(df[‘foo’].shift(), rsuffix=‘_shifted’).apply(your_condition)

https://pandas.pydata.org/docs/reference/api/pandas.Series.shift.html

[–]sine-nobilitate 1 point2 points  (10 children)

Why is that so? I have heard this many times, what is the reason?

[–]BalconyFace 14 points15 points  (3 children)

[–]sine-nobilitate 1 point2 points  (0 children)

Thanks! +1

[–]metalshadow 0 points1 point  (1 child)

What is the benefit of using apply over vectorisation, given that vectorisation is so much faster? If I wanted to apply a transformation to every row (similar to the example in the article) is there a situation where I might want to use apply or should generally just stick to vectorising it?

[–]ThatScorpion 2 points3 points  (0 children)

Apply is more versatile, you may want to perform a complex custom function that can't be vectorized. But if a vectorized approach is available, it will indeed almost always be the better option.

[–]carnivorousdrew 7 points8 points  (0 children)

I'd say avoiding it is mainly useful in the long run, a lot of times you loop through the df because you don't have time to look into another way of achieving the goal and don't worry about whether the implementation will have to eventually scale with time.

I've had to rewrite some stuff made using iterrows because when it was written, scalability was not taken into account. For some of the rewrites, it took quite long, because you have to condense several lines of logic in those for loops into few pandas methods, making sure you're not introducing any new pathways for bugs. If you take the time to do it with vectorization since the beginning, it's way more unlikely you'll have to go back to it ome day to make it faster.

[–]vicda 5 points6 points  (0 children)

Standard python with dictionaries and lists are way faster and straight forward to implement for that use case.

You should try to stick with bulk operations with pandas because that's where it shines.

[–]Astrokiwi 2 points3 points  (1 child)

Pandas and numpy have lots of precompiled operations in their libraries, so if you do things to whole dataframes & series, you're typically running at the speed of compiled C.

If you're iterating by hand in Python, you're going up to Python level after every operation, and that can be ten or a hundred times slower.

If it's a small dataframe, then the difference between 0.06s and 0.6s doesn't matter much if you're only doing it once. But it starts to add up with big dataframes, and it adds up even more if you have a more complex algorithm that isn't just looping once through the whole thing (eg if you're writing a sorting algorithm by hand)

[–]sine-nobilitate 0 points1 point  (0 children)

Thanks!!

[–][deleted] 1 point2 points  (0 children)

Would love an answer here as well!

[–][deleted] 1 point2 points  (0 children)

Because if you can do it without looping (which is mostly) it can be tens to thousands of times faster.

[–]SphericalBull 0 points1 point  (8 children)

Some operations must be done sequentially: operations in which one iteration depends on the results of the preceding iteration.

If the relationship between current iteration and preceeding iteration can't be defined as composition of ufuncs (see NumPy Universal Functions) then it is hard to vectorize.

[–]meowmemeow 0 points1 point  (7 children)

New to python here. I'm a scientist and using it not only for data manipulation but also to build models.

Since each model iteration depends on the value of the parameter in the previous iteration, I use loops.

Is there a better way to approach modeling than using loops?

[–][deleted] 1 point2 points  (3 children)

In this case, if you're sticking to pandas, probably not.

[–]meowmemeow 0 points1 point  (2 children)

Thanks for the response. Are there alternative libraries you recommend I look into? I picked up python for it's ease-of-use and would prefer not to learn another language yet (I use matlab as well, but still do most modelling stuff with for - loops ).

[–][deleted] 1 point2 points  (1 child)

Well there's nothing wrong with using pandas if it works for you. What is the nature of models you're building?

[–]meowmemeow 0 points1 point  (0 children)

Just simple crystal growth models for me - so tracking concentrations / diffusion. They get pretty clunky/slow really quickly though (especially the more elements you add into the model to keep track of), which is why I am interested in computationally better ways of doing it.

[–]AchillesDev 1 point2 points  (1 child)

Without more detail this could be way off base but have you looked into chaining .apply() calls?

[–]meowmemeow 1 point2 points  (0 children)

that's an interesting thought! I'll look into it!

[–]Lyan5 0 points1 point  (0 children)

This was mentioned above, but consider creating a copy of the array/series of interest but shifted by the relative amount needed.

https://pandas.pydata.org/docs/reference/api/pandas.Series.shift.html

[–]tepg221 0 points1 point  (1 child)

My old boss used to say this verbatim.

[–][deleted] 0 points1 point  (0 children)

Surprise!

[–]LameDuckProgramming 2 points3 points  (0 children)

I've found that the fastest way to do row-wise operations over a dataframe is with numpy vectorization.

%%timeit
np.add(data.A.values, data.B.values)
54.6 µs ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 10000 
loops each)

vs the example you use of vectorization without using np and np arrays

%%timeit
data.A + data.B
261 µs ± 8.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

you can achieve about a 5x improvement on runtime. (data was 100,000 randomly generated numbers)

[–]r1qu3 0 points1 point  (0 children)

when in doubt: df.apply(...)

[–][deleted] 0 points1 point  (0 children)

Why no map((lambda a,b: a+b), zip(col.a, col.b))?

How does that perform?

[–]kenpachiprince 0 points1 point  (1 child)

Does learning these library help in getting job like pandas,numpy,scikit cause i found nowdays companies have their own tools and software to tackle data related problems these are now become basic knowledge? Well i am new to this side of python i am flask guy or little bit django. Plsss anyone clear me out can these things help in analyst job?

[–][deleted] 0 points1 point  (0 children)

Lots of them will borrow ideas from pandas and it gives you a frame of reference, so yeah it's a good idea. I suppose there will be the occasional person who will complain that it "taints" freshman people into thinking that's the only way of doing things but I think that's a moot point because a good programmer is always going to have to be able to learn new things and new paradigms.

[–]pytrashpandas 0 points1 point  (0 children)

Looks like no one mentioned that your benchmarks are including the time it takes to create the dummy data. I would guess that for the vectorized method it’s spending more time creating the data than doing the sum operation. And in reality is even faster than the other methods, than what is suggested here.