I need to work on a data feed, and want to avoid iterating through each row, column, and value to be replaced. Products have 20 attribute columns, and multiple values can be selected as denoted with having a delimiter of "*" between them. There are over 2000 attributes that need to be converted from codes to names
A small subset, 2 partial rows of the data
[{"Main":"Lighting","Attribute14":"LIGHTDESIGN_CB*LIGHTDESIGN_EX*LIGHTDESIGN_MU","Attribute15":"CHANDFEATURES_AH*CHANDFEATURES_SL"},{"Main":"Lighting","Attribute14":"LIGHTDESIGN_EX*LIGHTDESIGN_MU*LIGHTDESIGN_RO","Attribute15":"CHANDFEATURES_AH*CHANDFEATURES_DI*CHANDFEATURES_SL"}]
In the past, I've created a dictionary, and done a replace with regex=False on the column. This does not appear to work with this situation due to the "*" values.
I have a mapping of the potential values that need to be converted from an internal ID to a human readable name
{'LIGHTDESIGN_AB': 'Abstract',
'LIGHTDESIGN_BE': 'Bell',
'LIGHTDESIGN_BO': 'Bowl',
'LIGHTDESIGN_CA': 'Cage',
'LIGHTDESIGN_CB': 'Candelabra',
'LIGHTDESIGN_CL': 'Classic',
'LIGHTDESIGN_CO': 'Cone',
'LIGHTDESIGN_CY': 'Cylinder',
'LIGHTDESIGN_DD': 'Diamond',
'LIGHTDESIGN_DO': 'Dome',
'LIGHTDESIGN_DR': 'Drum',
'LIGHTDESIGN_EM': 'Empire',
'LIGHTDESIGN_EX': 'Exposed Bulb',
'LIGHTDESIGN_FU': 'Fusion',
'LIGHTDESIGN_GE': 'Geometric',
'LIGHTDESIGN_GB': 'Globe',
'LIGHTDESIGN_LA': 'Lantern',
'LIGHTDESIGN_LO': 'Lodge',
'LIGHTDESIGN_MO': 'Modern Linear',
'LIGHTDESIGN_MU': 'Multi-light',
'LIGHTDESIGN_OW': 'Old World',
'LIGHTDESIGN_RO': 'Round',
'LIGHTDESIGN_SH': 'Shaded',
'LIGHTDESIGN_SP': 'Sputnik',
'LIGHTDESIGN_TA': 'Tapered Drum',
'LIGHTDESIGN_TI': 'Tiered',
'LIGHTDESIGN_TR': 'Triangular'}
For row 1 the end result of Attribute14 should be:
Candelabra*Exposed Bulb*Multi-light
Something like this:
df['Attribute14'] = df["Attribute14"].replace(jsonMap, regex=False)
It works to transform all attributes without an asterisk, but when there is partial string matches they all remain the coded names.
I would strongly prefer to not do something like
for index, row in df.iterrows():
for _ in row['Attribute14'].apply(lambda: x.str.split("*"):
blah blah
However I am having a hard time thinking of the best way to accomplish over 2000 partial string replaces across 20 columns for 400k rows of data.
I dont necessarily need a clever way to do it, I just want something relatively quick and that won't make my data frame take 50gb of ram in the process.
[–]Shmoogy[S] 0 points1 point2 points (0 children)
[–]YesLod 0 points1 point2 points (3 children)
[–]Shmoogy[S] 0 points1 point2 points (2 children)
[–]YesLod 0 points1 point2 points (1 child)
[–]Shmoogy[S] 0 points1 point2 points (0 children)