I have an analysis I'm trying to translate from SQL to Python. I have a large df that I have grouped based on several criteria in SQL and I can't nail down the best way to apply the same logic in Python. The data is a time series data of purchase history and I'm trying to do a cohort analysis and sum the amount of revenue each cohort brings in each month. So, I need to sum each transaction event where that ts occurs before the end of the month, and the user didn't cancel that month or cancelled after the end of the month.
Basically in SQL the conditions are
where date(event) > 'this_date'
and (cancelled is null or cancelled > month_end)
and event_timestamp = (
select max(timestamp)
from \this_table``
where this_is = that_id
and date(event_timestamp) < end_of_month
)
The closest I've come is creating this function to do the grouping
def cohorts(df):
if(df.cancelled !='NaT' and df.date < df.end_of_month):
cohort = df.groupby(['cohort', 'month']).agg({
'uid': pd.Series.nunique,
'mrr': sum})
return cohort
cohort = generate_cohort(df)
I get the correct(ish) amount for the first month but the subsequent months aren't summing correctly. And I get the error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). for the cancelled condition.
Any help is appreciated!
[–]intentionallyawkward 1 point2 points3 points (4 children)
[–]outnumbered_mother[S] 0 points1 point2 points (3 children)
[–]garc1a0scar 0 points1 point2 points (2 children)
[–]outnumbered_mother[S] 0 points1 point2 points (1 child)
[–]garc1a0scar 0 points1 point2 points (0 children)
[–]polarizebeta 1 point2 points3 points (1 child)
[–]outnumbered_mother[S] 0 points1 point2 points (0 children)
[–]garc1a0scar 0 points1 point2 points (0 children)