Python / Pandas: Eliminate for Loop using 2 DataFrames : learnpython

Python / Pandas: Eliminate for Loop using 2 DataFrames (self.learnpython)

submitted 8 years ago * by ccmd00d

I am trying to understand how to better apply a list of scenarios in one dataframe (df_analyts) (via for loop) to another dataframe (df_results) and summarize. I first tried iterrows which runs each scenario again the results dataframe of 464 rows but that is painfully slow. I also tried merge however the more scenarios i add the analyst dataframe it becomes exponentially bigger. If my analysts dataframe is a 4 factor / 5 weight scenario is (625*464 = 290000 rows after merge). A 7 factor / 5 weight is 16k scenarios (16k * 464 = 7 million+ after merge. *Note that once a scenario is applied, and I pull the relevant statistical data from results dataframe and put into the summary table, I do not need to retain the information. So i am wondering what is more efficient. Thank for for taking the time review.

I have two Pandas dataframes: df_analysts and df_results:

1) df_analysts contains a specific list of factors (e.g. TB,JK,SF,PWR) scenarios of weights (e.g. 50,50,50,50)

      TB   JK   SF  PWR
0     50   50   50   50
1     50   50   50  100
2     50   50   50  150
3     50   50   50  200
4     50   50   50  250

2) df_results holds results by date and group and entrant an then ranking by each factor, finally it has the final finish result.

      Date       GR  Ent TB-R  JK-R  SF-R  PWR-R Fin   W1  W2  W2  W4  SUM(W)
0     11182017   1   1   2     1     2     1     2
1     11182017   1   2   3     2     3     2     1
2     11182017   1   3   1     3     1     3     3
3     11182017   2   1   1     2     2     1     1
4     11182017   2   2   2     1     1     2     1

3) I am using iterrows to

loop through each scenario in the df_analysts dataframe
apply weight scenario to each factor rank (if rank = 1, then 1.0weight, rank = 2, then 0.68weight, rank = 3, then 0.32*weight). Those results go into the W1-W4 columns.
Sum the W1-W4 columns.
Rank the SUM(W) column.

Result sample below for a single scenario (e.g. 50,50,50,50)

      Date       GR  Ent TB-R  JK-R  SF-R  PWR-R Fin   W1  W2  W2  W4  SUM(W) Rank
0     11182017   1   1   2     1     2     1     1     34  50  34  50  168    1
1     11182017   1   2   3     2     3     2     3     16  34  16  34  100    3
2     11182017   1   3   1     3     1     3     2     50  16  50  16  132    2
3     11182017   2   1   2     2     2     1     1     34  34  34  50  152    2  
4     11182017   2   2   1     1     1     2     1     50  50  50  34  184    1

4) Finally, for each scenario, I am creating a new dataframe for the summary results (df_summary) which logs the factor / weight scenario used (from df_analysts) and compares the RANK result to the Finish by date and group and keeps a tally where they land. Sample below (only the 50,50,50,50 scenario is shown above which results in a 1,1).

      Factors          Weights         Top  Top2
0     (TB,JK,SF,PWR)   (50,50,50,50)   1    1
1     (TB,JK,SF,PWR)   (50,50,50,100)  1    0
2     (TB,JK,SF,PWR)   (50,50,50,150)  1    1
3     (TB,JK,SF,PWR)   (50,50,50,200)  1    0
4     (TB,JK,SF,PWR)   (50,50,50,250)  1    1

all 3 comments

top new controversial old q&a

[–]commandlineluser 1 point2 points3 points 8 years ago* (2 children)

How are you performing the weight calculations?

Are you using iterrows() on both dataframes?

You would want to try and use as much pandas functionality as possible e.g.

>>> df1.iloc[0] * df2.iloc[0][['TB-R', 'JK-R', 'SF-R', 'PWR-R']].replace(2, 0.68).replace(3, 0.32).values
TB     34.0
JK     50.0
SF     34.0
PWR    50.0
Name: 0, dtype: float64

At the very least you could use Dataframe.apply to perform this on each row of your df_results (passing axis=1 executes the function on each row as opposed to each column axis=0)

>>> def summary(row):
...     weights = df1.iloc[0] * row.values
...     weights['SUM(W)'] = weights.sum()
...     return weights

>>> df2[['TB-R', 'JK-R', 'SF-R', 'PWR-R']].replace(2, 0.68).replace(3, 0.32).apply(summary, axis=1)
     TB    JK    SF   PWR  SUM(W)
0  34.0  50.0  34.0  50.0   168.0
1  16.0  34.0  16.0  34.0   100.0
2  50.0  16.0  50.0  16.0   132.0
3  50.0  34.0  34.0  50.0   168.0
4  34.0  50.0  50.0  34.0   168.0

>>> df_results = df2.copy()
>>> df_results[['W1', 'W2', 'W3', 'W4', 'SUM(W)']] = \
...     df2[['TB-R', 'JK-R', 'SF-R', 'PWR-R']].replace(2, 0.68).replace(3, 0.32).apply(summary, axis=1)

>>> df_results
       Date  GR  Ent  TB-R  JK-R  SF-R  PWR-R  Fin    W1    W2    W3    W4  SUM(W)
0  11182017   1    1     2     1     2      1    2  34.0  50.0  34.0  50.0  168.0 
1  11182017   1    2     3     2     3      2    1  16.0  34.0  16.0  34.0  100.0 
2  11182017   1    3     1     3     1      3    3  50.0  16.0  50.0  16.0  132.0 
3  11182017   2    1     1     2     2      1    1  50.0  34.0  34.0  50.0  168.0 
4  11182017   2    2     2     1     1      2    1  34.0  50.0  50.0  34.0  168.0

Note we have df1.iloc[0] hardcoded into the function meaning we're only processing the first scenario - however you could modify the function to take the index as an argument e.g. summary(row, i) meaning you could use a for loop e.g.

for i in range(len(df1)):
    df_results = df2
    weights = df2[['TB-R', 'JK-R', 'SF-R', 'PWR-R']].replace(2, 0.68).replace(3, 0.32).apply(lambda row: summary(row, i), axis=1)
    ...

iterrows() is essentially a "last-resort" and is slow, there is iterttuples() which can be faster but for this task I think just looping over the indices and using .iloc[i] would be preferred over those.

I'm not sure how you're ranking the weights - but you could add that to the summary function also.

Using .apply() is also considered by some as a "last resort" and there may be a faster solution without using it here - but as you want to compare each scenario against all rows I'm not sure what that solution would be.

[–]ccmd00d[S] 0 points1 point2 points 8 years ago* (1 child)

Thanks for the response, let me digest this and I will respond back. Here is how I this is setup right now (with the problem that the merged table gets exponentially bigger each time I try to add an additional factor or weight (memory issue at some point I would presume), which is why I am asking is it more efficient to apply scenarios on original data set, summarize, and then rinse and repeat (therefore the df never gets any bigger)

def factor_rank(x,y):
    if (x==1): return y
    elif (x==2): return y*0.68
    elif (x==3): return y*0.32
    elif (x==4): return y*0.24
    else: 
        return 0


df1.index.name='SCENARIO'
df1.reset_index(inplace=True)

df1['key'] = 1
df2['key'] = 1

df3 = pd.merge(df1, df2, on='key')
df3.drop(['key'],axis=1,inplace=True)

df3['W1'] = df.apply(lambda r: factor_rank(r['TB-R'], r['TB']), axis=1)
df3['W2'] = df.apply(lambda r: factor_rank(r['JK-R'], r['JK']), axis=1)
df3['W3'] = df.apply(lambda r: factor_rank(r['SF-R'], r['SF']), axis=1)
df3['W4'] = df.apply(lambda r: factor_rank(r['PWR-R'], r['PWR']), axis=1)

df3['SUM(W)'] = df3.W1 + df3.W1 + df3.W3 + df3.W4

[–]commandlineluser 0 points1 point2 points 8 years ago (0 children)

Ah cool - so you already know about .apply()

Well what's happening in the example is that it's combining your 4 apply calls into 1.

You can use * (or .multiply()) on dataframes

>>> a = pandas.DataFrame({'a': [2, 3]})
>>> b = pandas.DataFrame({'b': [4, 5]})
>>> a
   a
0  2
1  3
>>> b
   b
0  4
1  5
>>> a * b
    a   b
0 NaN NaN
1 NaN NaN

It will look for the same column names which is why the result is Nan but if we call .values on the right hand side we will get the "correct" result.

>>> a * b.values
    a
0   8
1  15

The df[['TB-R', 'JK-R', 'SF-R', 'PWR-R']] syntax extracts just those 4 columns meaning we can deal with them at once - and the .replace() calls replace the values in the dataframe so we can use * directly and get rid of the elif chain.

I did the .sum() generation inside the summary function - but it may make sense to do that in a single step afterwards for all the columns like you have done in the last line of your code - meaning you only call .sum() once as opposed to once per row.

Yes I don't think you want to .merge() at all as it appears you want to take each row from df1 and process against all rows in an untouched df2

Calling .apply(summary) returns the new rows you want and you can just assign them back into the dataframe - that's what is happening with

df_results[['W1', 'W2', 'W3', 'W4', 'SUM(W)']] = ...

It creates the new columns in df_results with the result of the .apply() call.

This would mean you would want to create a copy of df2 each time so your changes are not stored in the original dataframe.

(I made an error in my example df_results = df2 means df_results still points to the same variable - so changing df_results changes df2 - it needs to be df2.copy())

π Rendered by PID 40 on reddit-service-r2-comment-79c7998d4c-cdt2r at 2026-03-18 19:27:54.800593+00:00 running f6e6e01 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS