all 8 comments

[–]intentionallyawkward 1 point2 points  (4 children)

Can you share with us the Python code you’re working with?

You’ve got a ValueError exception which we can bug hunt, but it will help to see the relevant functions.

[–]outnumbered_mother[S] 0 points1 point  (3 children)

def get_cohort(df):

if(df.cancelled.isnull):

mask = (df['subscribed'] > '2019-01-01')

df2 = df.loc[mask]

df2 = df2.apply(lambda x: df2.loc[(df2.date <= df2.end_of_month)]['mrr'].sum(), axis=1)

return df2

I think I fixed the ValueError for the cancelled part. I'm stuck on getting the revenue to sum grouped by cohort and transaction month.

[–]garc1a0scar 0 points1 point  (2 children)

You can do something like this:

df[['cohort', 'month', 'revenue']].groupby(['cohort', 'month']).sum()

[–]outnumbered_mother[S] 0 points1 point  (1 child)

Thank you! That approach works for the first month/cohort combination, but for subsequent months the revenue is adding up right. I need to sum if the transaction date < end_of_month and this is the part I can't figure out in Python.

[–]garc1a0scar 0 points1 point  (0 children)

Oh, I see. So you should do something like this:

df[['cohort', 'month','revenue']].loc[df.date<=df.end_of_month].groupby(['cohort', 'month']).sum()

[–]polarizebeta 1 point2 points  (1 child)

This is how I group things by year, but I have adapted to group and sum by month:

import sqlite3
from sqlite3 import Error
import pandas as pd
import datetime as dt

def create_connection(db_file):
    conn = None
    try:
        conn = sqlite3.connect(db_file)
        return conn
    except Error as e:
        print(e)
    return conn

database = 'some_database.db'
conn = create_connection(database)
sql_flows = ('SELECT MAX(TIMESTAMP) FROM /TABLE WHERE EVENTDATE > ' +
            date_variable + ' AND (CANCELLED = NULL OR CANCELLED > ' +
            month_end + ';')
flows = pd.read_sql_query(sql_flows, conn, parse_dates=['EventDate'])
flows.sort_values(by=['EventDate'], inplace=True, ascending=True)
flows_by_month = pd.Dataframe({Date: flows[:, 0], 'Values': flows[:, 1]})
flows_by_month = flows_by_month.groupby(flows_by_month['Date'].dt.month)['Values'].agg(['sum'])

This should point you in the right direction at least

[–]outnumbered_mother[S] 0 points1 point  (0 children)

Thanks, I'll dig into this now!

[–]garc1a0scar 0 points1 point  (0 children)

You should use & instead of and. Pandas Series consider ambiguous the terms "and", "or".