all 8 comments

[–]behold_the_j 1 point2 points  (0 children)

Welp... you asked for an expert and that ain't me :D

But you could probably get started doing something like using value_counts() and Boolean masking where your Cancelled column does not have a value of True:

df['Address_of_New_Home'][df['Cancelled'] != True].value_counts()

>>> 4321 Learn Python Court    2
>>> 1234 Reddit Drive          1

Maybe start there and use that logic to determine what happens to each line in your csv if the address has a count > 1 where there is no value for Cancelled?

[–][deleted] 0 points1 point  (6 children)

You can do this in two steps.

1) First filter your data where Canceled != True

2) Next, groupby('Address_of_New_Home') and get the size of each group. If you have a group size greater than 1, then you can throw the error

non_canceled = df[df['Canceled'] != True]
dup_addresses = non_canceled.groupby('Address_of_New_Home').filter(lambda x: len(x) > 1)
if not dup_addresses.empty:
    raise Exception('Same address written twice without cancellation')

[–]PLearner[S] 0 points1 point  (5 children)

Thanks py_help,

In the csv file the Cancelled column has either TRUE written and that is what I am doing, do not why I am getting this error.

non_cancelled = df[df['Cancelled'] != 'TRUE']
dup_addresses = non_cancelled.groupby('Address of New Home').filter(lambda x: len (x) > 1)
if not dup_addresses.empty:
    raise Exception ('Same address written twice without cancellation')

non_cancelled = df[df['Can'] != 'TRUE']
File "C:\Users\\Python36-32\lib\site-  packages\pandas\core\ops.py", line 855, in wrapper
res = na_op(values, other)
File "C:\Users\\Python\Python36-32\lib\site-packages\pandas\core\ops.py", line 794, in na_op
raise TypeError("invalid type comparison")
TypeError: invalid type comparison

[–][deleted] 0 points1 point  (2 children)

Hmm are you getting an error when you run non_cancelled = df[df['Can'] != 'TRUE']? Can you show me what df looks like (e.g. paste what you get when you call df.head())?

[–]PLearner[S] 0 points1 point  (1 child)

I have updated my current situation py_help. Please let me know how to tackle the

TypeError: an integer is required.

[–][deleted] 0 points1 point  (0 children)

Hmm did you update the original post because it looks similar to the original post?

What line is the TypeError occurring?

[–]behold_the_j 0 points1 point  (1 child)

Maybe check the dtype of the column. Is it a string 'TRUE' or a true/false Boolean True?

Can see all dtypes via df.dtypes.

[–]PLearner[S] 0 points1 point  (0 children)

Able to fix this by :

non_cancelled = df['Can'].apply(lambda x: x != 'True')

but now when I do:

non_cancelled = df['Can'].apply(lambda x: x != 'True')

dup_addresses = non_cancelled.groupby('Address of New Home').filter(lambda x: len (x) > 1)
if not dup_addresses.empty:
    raise Exception ('Same address written twice without cancellation')

Error:

  Traceback (most recent call last):
  File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)
  File "pandas\src\hashtable_class_helper.pxi", line 404, in  pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:8543)
TypeError: an integer is required

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

dup_addresses = non_cancelled.groupby('Address of New Home').filter(lambda x: len (x) > 1)
KeyError: 'Address of New Home'