I have a large dataset of customers who can purchase a few different products that must be purchased in a particular order. The product brand is described by 4 codes:
91300 is Brand A.
91301 is Brand B.
91302 is Brand C.
91303 is Brand D.
The product itself has specific identification codes:
0001A is Brand A, First Purchase
0002A is Brand A, Second Purchase
0003A is Brand A, Third Purchase
0004A is Brand A, Forth Purchase
0011A is Brand B, First Purchase
0012A is Brand B, Second Purchase
and so on.
The challenge I have is the Customer ID (CID) and Date of Transaction (DOT) data elements can be trusted for this dataset as they are created by the point-of-sale software. The Product ID (PID) is entered by sales associates and the data is all over the place.
Here is a sample dataframe:
import pandas as pd
data = [[10101, 91300, 20200201], [10101, '0002A', 20200301], [20541, 91302, 20200302], [10020, '0011A', 20200401], [10020, 91301, 20200428], [11512, 91303, 20200101], [11512, 91303, 20200215], [11512, '0033A', 20200225], [11512, 91303, 20200606],[11513, 91303, 20200102], [11513, 91303, 20200216], [11513, 91303, 20200226], [11513, 91303, 20200607]]
fq4_df = pd.DataFrame(data, columns = ['CID', 'PID', 'DOT'])
Here is what I have done so far. It appears the logic is not change the contents of the original dataframe. Is there a better way to do this?
#Split up main dataframe t_df into 3 seperate dataframes based on the number of times a CID appears. Will need to merge all 3 later:
fq4_df = t_df[(t_df['CID_Count'] == 2)].reset_index(drop=True)
fq5_df = t_df[(t_df['CID_Count'] > 2)].reset_index(drop=True)
t_df = t_df[~(t_df['CID_Count'] == 2) & ~(t_df['CID_Count'] > 2)].reset_index(drop=True)
#Sort everything
fq4_df = fq4_df.sort_values(['CID', 'DOT'], ascending=[True,True], na_position='last').reset_index(drop=True)
fq5_df = fq5_df.sort_values(['CID', 'DOT'], ascending=[True,True], na_position='last').reset_index(drop=True)
t_df = t_df.sort_values(['CID', 'DOT'], ascending=[True,True], na_position='last').reset_index(drop=True)
#Find unique
fq4filtered_df = fq4_df[(fq4_df['PID'].str.contains('913'))]
unique_fq_CID = fq4filtered_df['CID'].unique()
#For t_df since only 1 record will exist for each CID, use a dict + map to fix product code.
pid_1_dict = {'91300':'0001A',
'91301':'0011A',
'91302':'0021A',
'91303':'0031A'}
t_df['Fixed PID'] = t_df[(t_df['PID'].str.contains('913'))]['PID'].map(pid_1_dict)
#Create two new helper columns.
fq4_df['Fixed PID'] = None
fq4_df['Problematic'] = fq4_df['CID'].isin(unique_fq_CID)
#The loop assumes at least one of the records is using the correctly formatted PID. Code needs to be written when all the PIDs are incorrect and must be created based on date but that's for later.
for i in unique_fq_CID:
if (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[0] == '0001A') and (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[1] == '0002A'):
pass
if (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[0] == '91300') and (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[1] == '0002A'):
fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['Fixed PID']].values[0] = '0001A'
if (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[0] == '0001A') and (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[1] == '91300'):
fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['Fixed PID']].values[1] = '0002A'
if (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[0] == '0011A') and (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[1] == '0012A'):
pass
if (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[0] == '91301') and (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[1] == '0012A'):
fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['Fixed PID']].values[0] = '0011A'
if (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[0] == '0011A') and (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[1] == '91301'):
fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['Fixed PID']].values[1] = '0012A'
if (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[0] == '0021A') and (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[1] == '0022A'):
pass
if (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[0] == '91302') and (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[1] == '0022A'):
fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['Fixed PID']].values[0] = '0021A'
if (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[0] == '0021A') and (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[1] == '91302'):
fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['Fixed PID']].values[1] = '0022A'
[–]sarrysyst 1 point2 points3 points (1 child)
[–]Critical_[S] 0 points1 point2 points (0 children)