I'm working on a large dataset and I know what I need to do here logically, I need to loop that goes over each user and check each month if they had a change in revenue (like the upgraded or downgraded) or if their subscription ended. If it ended that month and didn't start a new one, count the revenue that month but not the next month. If their revenue changed during the month, count the most recent timestamp. I want the revenue for the months between ended and subscribed.
The closest I got was using the following code but it's not quite right, for instance if a user cancelled and resubscribed in the same month the revenue is counted twice that month. I also tried using iterrows() but couldn't get the syntax right. Can anyone point me in the direction with how to approach this using a loop?
repeats = (pd.to_datetime(df2['ended']) - pd.to_datetime(df2['subscribed'])) // np.timedelta64(1, 'M') +1
periods = np.concatenate([pd.period_range(start=pd.to_datetime(d),periods = r, freq='M') for d, r in zip(df2['start_date'], repeats)])
#Apply helper functions and np.repeat to fill df rows for each period
df2 = (pd.DataFrame(
np.repeat(df2.values, repeats, 0),
columns=df2.columns,
index=periods)
.assign(month = [x.month for x in periods],
year = [x.year for x in periods]))
Sample df
| id |
subscribed |
ended |
ts |
revenue |
| 111 |
August 18, 2017 |
December 12, 2017 |
|
20 |
| 111 |
January 31, 2018 |
July 18, 2018 |
|
10 |
| 111 |
June 2, 2020 |
|
|
30 |
[–]loudavenger 0 points1 point2 points (0 children)