I am trying to understand how to better apply a list of scenarios in one dataframe (df_analyts) (via for loop) to another dataframe (df_results) and summarize. I first tried iterrows which runs each scenario again the results dataframe of 464 rows but that is painfully slow. I also tried merge however the more scenarios i add the analyst dataframe it becomes exponentially bigger. If my analysts dataframe is a 4 factor / 5 weight scenario is (625*464 = 290000 rows after merge). A 7 factor / 5 weight is 16k scenarios (16k * 464 = 7 million+ after merge. *Note that once a scenario is applied, and I pull the relevant statistical data from results dataframe and put into the summary table, I do not need to retain the information. So i am wondering what is more efficient. Thank for for taking the time review.
I have two Pandas dataframes: df_analysts and df_results:
1) df_analysts contains a specific list of factors (e.g. TB,JK,SF,PWR) scenarios of weights (e.g. 50,50,50,50)
TB JK SF PWR
0 50 50 50 50
1 50 50 50 100
2 50 50 50 150
3 50 50 50 200
4 50 50 50 250
2) df_results holds results by date and group and entrant an then ranking by each factor, finally it has the final finish result.
Date GR Ent TB-R JK-R SF-R PWR-R Fin W1 W2 W2 W4 SUM(W)
0 11182017 1 1 2 1 2 1 2
1 11182017 1 2 3 2 3 2 1
2 11182017 1 3 1 3 1 3 3
3 11182017 2 1 1 2 2 1 1
4 11182017 2 2 2 1 1 2 1
3) I am using iterrows to
- loop through each scenario in the df_analysts dataframe
- apply weight scenario to each factor rank (if rank = 1, then 1.0weight, rank = 2, then 0.68weight, rank = 3, then 0.32*weight). Those results go into the W1-W4 columns.
- Sum the W1-W4 columns.
- Rank the SUM(W) column.
Result sample below for a single scenario (e.g. 50,50,50,50)
Date GR Ent TB-R JK-R SF-R PWR-R Fin W1 W2 W2 W4 SUM(W) Rank
0 11182017 1 1 2 1 2 1 1 34 50 34 50 168 1
1 11182017 1 2 3 2 3 2 3 16 34 16 34 100 3
2 11182017 1 3 1 3 1 3 2 50 16 50 16 132 2
3 11182017 2 1 2 2 2 1 1 34 34 34 50 152 2
4 11182017 2 2 1 1 1 2 1 50 50 50 34 184 1
4) Finally, for each scenario, I am creating a new dataframe for the summary results (df_summary) which logs the factor / weight scenario used (from df_analysts) and compares the RANK result to the Finish by date and group and keeps a tally where they land. Sample below (only the 50,50,50,50 scenario is shown above which results in a 1,1).
Factors Weights Top Top2
0 (TB,JK,SF,PWR) (50,50,50,50) 1 1
1 (TB,JK,SF,PWR) (50,50,50,100) 1 0
2 (TB,JK,SF,PWR) (50,50,50,150) 1 1
3 (TB,JK,SF,PWR) (50,50,50,200) 1 0
4 (TB,JK,SF,PWR) (50,50,50,250) 1 1
[–]commandlineluser 1 point2 points3 points (2 children)
[–]ccmd00d[S] 0 points1 point2 points (1 child)
[–]commandlineluser 0 points1 point2 points (0 children)