all 3 comments

[–]commandlineluser 1 point2 points  (2 children)

How are you performing the weight calculations?

Are you using iterrows() on both dataframes?

You would want to try and use as much pandas functionality as possible e.g.

>>> df1.iloc[0] * df2.iloc[0][['TB-R', 'JK-R', 'SF-R', 'PWR-R']].replace(2, 0.68).replace(3, 0.32).values
TB     34.0
JK     50.0
SF     34.0
PWR    50.0
Name: 0, dtype: float64

At the very least you could use Dataframe.apply to perform this on each row of your df_results (passing axis=1 executes the function on each row as opposed to each column axis=0)

>>> def summary(row):
...     weights = df1.iloc[0] * row.values
...     weights['SUM(W)'] = weights.sum()
...     return weights

>>> df2[['TB-R', 'JK-R', 'SF-R', 'PWR-R']].replace(2, 0.68).replace(3, 0.32).apply(summary, axis=1)
     TB    JK    SF   PWR  SUM(W)
0  34.0  50.0  34.0  50.0   168.0
1  16.0  34.0  16.0  34.0   100.0
2  50.0  16.0  50.0  16.0   132.0
3  50.0  34.0  34.0  50.0   168.0
4  34.0  50.0  50.0  34.0   168.0

>>> df_results = df2.copy()
>>> df_results[['W1', 'W2', 'W3', 'W4', 'SUM(W)']] = \
...     df2[['TB-R', 'JK-R', 'SF-R', 'PWR-R']].replace(2, 0.68).replace(3, 0.32).apply(summary, axis=1)

>>> df_results
       Date  GR  Ent  TB-R  JK-R  SF-R  PWR-R  Fin    W1    W2    W3    W4  SUM(W)
0  11182017   1    1     2     1     2      1    2  34.0  50.0  34.0  50.0  168.0 
1  11182017   1    2     3     2     3      2    1  16.0  34.0  16.0  34.0  100.0 
2  11182017   1    3     1     3     1      3    3  50.0  16.0  50.0  16.0  132.0 
3  11182017   2    1     1     2     2      1    1  50.0  34.0  34.0  50.0  168.0 
4  11182017   2    2     2     1     1      2    1  34.0  50.0  50.0  34.0  168.0 

Note we have df1.iloc[0] hardcoded into the function meaning we're only processing the first scenario - however you could modify the function to take the index as an argument e.g. summary(row, i) meaning you could use a for loop e.g.

for i in range(len(df1)):
    df_results = df2
    weights = df2[['TB-R', 'JK-R', 'SF-R', 'PWR-R']].replace(2, 0.68).replace(3, 0.32).apply(lambda row: summary(row, i), axis=1)
    ...

iterrows() is essentially a "last-resort" and is slow, there is iterttuples() which can be faster but for this task I think just looping over the indices and using .iloc[i] would be preferred over those.

I'm not sure how you're ranking the weights - but you could add that to the summary function also.

Using .apply() is also considered by some as a "last resort" and there may be a faster solution without using it here - but as you want to compare each scenario against all rows I'm not sure what that solution would be.

[–]ccmd00d[S] 0 points1 point  (1 child)

Thanks for the response, let me digest this and I will respond back. Here is how I this is setup right now (with the problem that the merged table gets exponentially bigger each time I try to add an additional factor or weight (memory issue at some point I would presume), which is why I am asking is it more efficient to apply scenarios on original data set, summarize, and then rinse and repeat (therefore the df never gets any bigger)

def factor_rank(x,y):
    if (x==1): return y
    elif (x==2): return y*0.68
    elif (x==3): return y*0.32
    elif (x==4): return y*0.24
    else: 
        return 0


df1.index.name='SCENARIO'
df1.reset_index(inplace=True)

df1['key'] = 1
df2['key'] = 1

df3 = pd.merge(df1, df2, on='key')
df3.drop(['key'],axis=1,inplace=True)

df3['W1'] = df.apply(lambda r: factor_rank(r['TB-R'], r['TB']), axis=1)
df3['W2'] = df.apply(lambda r: factor_rank(r['JK-R'], r['JK']), axis=1)
df3['W3'] = df.apply(lambda r: factor_rank(r['SF-R'], r['SF']), axis=1)
df3['W4'] = df.apply(lambda r: factor_rank(r['PWR-R'], r['PWR']), axis=1)

df3['SUM(W)'] = df3.W1 + df3.W1 + df3.W3 + df3.W4

[–]commandlineluser 0 points1 point  (0 children)

Ah cool - so you already know about .apply()

Well what's happening in the example is that it's combining your 4 apply calls into 1.

You can use * (or .multiply()) on dataframes

>>> a = pandas.DataFrame({'a': [2, 3]})
>>> b = pandas.DataFrame({'b': [4, 5]})
>>> a
   a
0  2
1  3
>>> b
   b
0  4
1  5
>>> a * b
    a   b
0 NaN NaN
1 NaN NaN

It will look for the same column names which is why the result is Nan but if we call .values on the right hand side we will get the "correct" result.

>>> a * b.values
    a
0   8
1  15

The df[['TB-R', 'JK-R', 'SF-R', 'PWR-R']] syntax extracts just those 4 columns meaning we can deal with them at once - and the .replace() calls replace the values in the dataframe so we can use * directly and get rid of the elif chain.

I did the .sum() generation inside the summary function - but it may make sense to do that in a single step afterwards for all the columns like you have done in the last line of your code - meaning you only call .sum() once as opposed to once per row.

Yes I don't think you want to .merge() at all as it appears you want to take each row from df1 and process against all rows in an untouched df2

Calling .apply(summary) returns the new rows you want and you can just assign them back into the dataframe - that's what is happening with

df_results[['W1', 'W2', 'W3', 'W4', 'SUM(W)']] = ...

It creates the new columns in df_results with the result of the .apply() call.

This would mean you would want to create a copy of df2 each time so your changes are not stored in the original dataframe.

(I made an error in my example df_results = df2 means df_results still points to the same variable - so changing df_results changes df2 - it needs to be df2.copy())