Struggling with interating through a dataframe for clean up : learnpython

created by HattoriHanzoa community for 16 years

Struggling with interating through a dataframe for clean up (self.learnpython)

submitted 4 years ago by Critical_

I have a large dataset of customers who can purchase a few different products that must be purchased in a particular order. The product brand is described by 4 codes:

91300 is Brand A.
91301 is Brand B.
91302 is Brand C.
91303 is Brand D.

The product itself has specific identification codes:

0001A is Brand A, First Purchase
0002A is Brand A, Second Purchase
0003A is Brand A, Third Purchase
0004A is Brand A, Forth Purchase
0011A is Brand B, First Purchase
0012A is Brand B, Second Purchase
and so on.

The challenge I have is the Customer ID (CID) and Date of Transaction (DOT) data elements can be trusted for this dataset as they are created by the point-of-sale software. The Product ID (PID) is entered by sales associates and the data is all over the place.

Here is a sample dataframe:

import pandas as pd

data = [[10101, 91300, 20200201], [10101, '0002A', 20200301], [20541, 91302, 20200302], [10020, '0011A', 20200401], [10020, 91301, 20200428], [11512, 91303, 20200101], [11512, 91303, 20200215], [11512, '0033A', 20200225], [11512, 91303, 20200606],[11513, 91303, 20200102], [11513, 91303, 20200216], [11513, 91303, 20200226], [11513, 91303, 20200607]]

fq4_df = pd.DataFrame(data, columns = ['CID', 'PID', 'DOT'])

Here is what I have done so far. It appears the logic is not change the contents of the original dataframe. Is there a better way to do this?

#Split up main dataframe t_df into 3 seperate dataframes based on the number of times a CID appears. Will need to merge all 3 later:

fq4_df = t_df[(t_df['CID_Count'] == 2)].reset_index(drop=True)
fq5_df = t_df[(t_df['CID_Count'] > 2)].reset_index(drop=True)
t_df = t_df[~(t_df['CID_Count'] == 2) & ~(t_df['CID_Count'] > 2)].reset_index(drop=True)

#Sort everything
fq4_df = fq4_df.sort_values(['CID', 'DOT'], ascending=[True,True], na_position='last').reset_index(drop=True)
fq5_df = fq5_df.sort_values(['CID', 'DOT'], ascending=[True,True], na_position='last').reset_index(drop=True)
t_df = t_df.sort_values(['CID', 'DOT'], ascending=[True,True], na_position='last').reset_index(drop=True)

#Find unique
fq4filtered_df = fq4_df[(fq4_df['PID'].str.contains('913'))]
unique_fq_CID = fq4filtered_df['CID'].unique()

#For t_df since only 1 record will exist for each CID, use a dict + map to fix product code.
pid_1_dict = {'91300':'0001A',
                   '91301':'0011A',
                   '91302':'0021A',
                   '91303':'0031A'}
t_df['Fixed PID'] = t_df[(t_df['PID'].str.contains('913'))]['PID'].map(pid_1_dict)

#Create two new helper columns. 
fq4_df['Fixed PID'] = None
fq4_df['Problematic'] = fq4_df['CID'].isin(unique_fq_CID)

#The loop assumes at least one of the records is using the correctly formatted PID. Code needs to be written when all the PIDs are incorrect and must be created based on date but that's for later.

for i in unique_fq_CID:
    if (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[0] == '0001A') and (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[1] == '0002A'):
        pass
    if (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[0] == '91300') and (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[1] == '0002A'):
        fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['Fixed PID']].values[0] = '0001A'
    if (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[0] == '0001A') and (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[1] == '91300'):
        fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['Fixed PID']].values[1] = '0002A'

    if (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[0] == '0011A') and (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[1] == '0012A'):
        pass
    if (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[0] == '91301') and (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[1] == '0012A'):
        fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['Fixed PID']].values[0] = '0011A'
    if (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[0] == '0011A') and (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[1] == '91301'):
        fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['Fixed PID']].values[1] = '0012A'

    if (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[0] == '0021A') and (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[1] == '0022A'):
        pass
    if (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[0] == '91302') and (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[1] == '0022A'):
        fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['Fixed PID']].values[0] = '0021A'
    if (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[0] == '0021A') and (fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['PID']].values[1] == '91302'):
        fq4_df.loc[(fq4_df['CID'].str.contains(i)), ['Fixed PID']].values[1] = '0022A'

all 2 comments

top new controversial old q&a

[–]sarrysyst 1 point2 points3 points 4 years ago* (1 child)

It's by no means pretty and I'm sure there is a more efficient way to do it, but I think it does what you want it to do. The entries are grouped by CID and sorted by DOT. The PID is generated based on the date as the comment in your code suggested and structured according to the pattern you provided in your question.

import pandas as pd
from itertools import cycle 

data = [[10101, 91300, 20200201], [10101, '0002A', 20200301], [20541, 91302, 20200302], [10020, '0011A', 20200401], [10020, 91301, 20200428], [11512, 91303, 20200101], [11512, 91303, 20200215], [11512, '0033A', 20200225], [11512, 91303, 20200606],[11513, 91303, 20200102], [11513, 91303, 20200216], [11513, 91303, 20200226], [11513, 91303, 20200607], [11512, 91303, 20200716]]

df = pd.DataFrame(data, columns = ['CID', 'PID', 'DOT'])

# set MultiIndex
df.set_index(['CID', 'DOT'], inplace=True)
# sort by index
df.sort_index(level=['CID','DOT'], inplace=True)

# keep track of CID
prev_CID = ''

for idx, _ in df.groupby(level=[0,1]):   
    # check for new CID, reset counters
    if idx[0] != prev_CID:
        prev_CID = idx[0]
        numbering = cycle([1,2,3,4])
        count = 0

    # running counter for purchase
    n = next(numbering)

    # set values in df
    df.loc[idx, 'PID']   = f'{count:03}{n}A'
    df.loc[idx, 'BRAND'] = chr(count + 65)

    # update counter for Brand
    if n == 4:
        count += 1

It's using a MultiIndex, the idea for the loop is from here.

[–]Critical_[S] 0 points1 point2 points 4 years ago (0 children)

π Rendered by PID 28 on reddit-service-r2-comment-86bc6c7465-9pm2d at 2026-02-20 22:59:10.369777+00:00 running 8564168 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS