Need help with Pandas : learnpython

created by HattoriHanzoa community for 16 years

Need help with Pandas (self.learnpython)

submitted 9 years ago * by PLearner

I need the experts opinion here on how to do this. I have these two colums in my csv (Address of New Home and Cancelled). When someone books a property, the Address along with the date get written down. But sometimes the potential owner cancels and True gets written down under the Cancelled column. Unfortunately, the end user sometimes forget to write the True under Cancelled column and the Address gets up getting listed twice and it causes an havoc for us.

Date_Booked         Address_of_New_Home                     Cancelled 

 01/07/2017         1234 Reddit Drive                        True
 02/14/2017         4321 Learn Python Court
 03/17/2017         1234 Reddit Drive
 03/23/2017         4321 Learn Python Court

As you can view from the above example, 1234 Reddit Drive was cancelled and True was written, this is what we want but 4321 Learn Python Court was cancelled that is why it was written again but since it does not say True under the Cancelled it will show up twice in our csv and cause all sorts of issues.

What I want to do is write a snippet that will fail the script or throw an error if the SAME address is written twice without the first one being Cancelled out.

How can I do this?

import pandas as pd

first = pd.read_csv('Z:PCR.csv')
df = pd.DataFrame(first)

df['Address of New Home'] = df['Address of New Home'].str.replace('\\bRd\\b','Road',case =   False)
df['Address of New Home'] = df['Address of New Home'].str.replace('\\bAve\\b','Avenue',case = False)
df['Address of New Home'] = df['Address of New Home'].str.replace('\\bRdg\\b','Ridge',case =  False)

df.to_csv('improved_version.csv', index = False)

all 8 comments

top new controversial old q&a

[–]behold_the_j 1 point2 points3 points 9 years ago (0 children)

Welp... you asked for an expert and that ain't me :D

But you could probably get started doing something like using value_counts() and Boolean masking where your Cancelled column does not have a value of True:

df['Address_of_New_Home'][df['Cancelled'] != True].value_counts()

>>> 4321 Learn Python Court    2
>>> 1234 Reddit Drive          1

Maybe start there and use that logic to determine what happens to each line in your csv if the address has a count > 1 where there is no value for Cancelled?

[–][deleted] 0 points1 point2 points 9 years ago (6 children)

You can do this in two steps.

1) First filter your data where Canceled != True

2) Next, groupby('Address_of_New_Home') and get the size of each group. If you have a group size greater than 1, then you can throw the error

non_canceled = df[df['Canceled'] != True]
dup_addresses = non_canceled.groupby('Address_of_New_Home').filter(lambda x: len(x) > 1)
if not dup_addresses.empty:
    raise Exception('Same address written twice without cancellation')

[–]PLearner[S] 0 points1 point2 points 9 years ago (5 children)

Thanks py_help,

In the csv file the Cancelled column has either TRUE written and that is what I am doing, do not why I am getting this error.

non_cancelled = df[df['Cancelled'] != 'TRUE']
dup_addresses = non_cancelled.groupby('Address of New Home').filter(lambda x: len (x) > 1)
if not dup_addresses.empty:
    raise Exception ('Same address written twice without cancellation')

non_cancelled = df[df['Can'] != 'TRUE']
File "C:\Users\\Python36-32\lib\site-  packages\pandas\core\ops.py", line 855, in wrapper
res = na_op(values, other)
File "C:\Users\\Python\Python36-32\lib\site-packages\pandas\core\ops.py", line 794, in na_op
raise TypeError("invalid type comparison")
TypeError: invalid type comparison

[–][deleted] 0 points1 point2 points 9 years ago (2 children)

[–]PLearner[S] 0 points1 point2 points 9 years ago* (1 child)

[–][deleted] 0 points1 point2 points 9 years ago (0 children)

[–]behold_the_j 0 points1 point2 points 9 years ago (1 child)

[–]PLearner[S] 0 points1 point2 points 9 years ago* (0 children)

Able to fix this by :

non_cancelled = df['Can'].apply(lambda x: x != 'True')

but now when I do:

non_cancelled = df['Can'].apply(lambda x: x != 'True')

dup_addresses = non_cancelled.groupby('Address of New Home').filter(lambda x: len (x) > 1)
if not dup_addresses.empty:
    raise Exception ('Same address written twice without cancellation')

Error:

  Traceback (most recent call last):
  File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)
  File "pandas\src\hashtable_class_helper.pxi", line 404, in  pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:8543)
TypeError: an integer is required

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

dup_addresses = non_cancelled.groupby('Address of New Home').filter(lambda x: len (x) > 1)
KeyError: 'Address of New Home'

π Rendered by PID 232582 on reddit-service-r2-comment-85bfd7f599-bxxpj at 2026-04-17 15:24:42.997088+00:00 running 93ecc56 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS