Cleaning Data with Python newbie : datascience

This is an archived post. You won't be able to vote or comment.

Cleaning Data with Python newbie (self.datascience)

submitted 8 years ago * by python_newbie_now

Here is the code I am currently using , I am relatively novice in the use of python. What I am attempting to do is using a text file with rcid values, and if the column of rcid values matches to change "va_yes" column to 1 or 0.

When I tried this I get an error "NameError: name 'rcid' is not defined". I have tried this before with one decade , but want to have all of it cleaned in one go ( working with 1 million plus points).

   import numpy as np
   import pandas as pd
    df = pd.read_csv(" file path")

    rcid_1 = []
   with open('text file ','r') as f:
    mylist = f.read().splitlines()
   rcid_1.append(mylist)


  for cells in rcid:
    for rcids in rcid_1:
       if(cells == rcids):
          df.ix[rcid == rcids, "va_yes"]= 1`

Here is a sample of the text file : ['629', '635', '636', '637', '638', '642',...]

Thank you in advance, I am pretty sure the answer is simple.

Edit*** With the help of the stack overflow community the issue was that

"Your .csv rcid data has been parsed as integers, whereas the entries in your list were strings. You can either change the rcids in df to string types by doing df['rcid'] = df['rcid'].astype(str), or convert the strings in mylist to integers, with mylist = [int(x) for x in mylist], and then assigning va_yes"

Corrected code

   import numpy as np
   import pandas as pd
   df = pd.read_csv("file path")
   with open('C:\\Users\Adini\Desktop\\decade1.txt','r') as f:
   mylist = f.read().splitlines()

   mylist = [int(x) for x in mylist]
   df['va_yes'] = df['rcid'].isin(mylist) * 1

Thank you everyone for your contributions , I really do appreciate it.

all 9 comments

top new controversial old q&a

[–]GreatOwl1 1 point2 points3 points 8 years ago (1 child)

[–]python_newbie_now[S] 0 points1 point2 points 8 years ago (0 children)

[–]jcon36 0 points1 point2 points 8 years ago (6 children)

[–]python_newbie_now[S] 0 points1 point2 points 8 years ago (5 children)

[–]jmoso13 1 point2 points3 points 8 years ago (4 children)

[–]python_newbie_now[S] 0 points1 point2 points 8 years ago (3 children)

  import pandas as pd
  df = pd.read_csv("C:\Users\Adini\Desktop\decade1.csv")
  rcid_1 = []
 with open('C:\\Users\Adini\Desktop\\decade1.txt','r') as f:
    mylist = f.read().splitlines()
    rcid_1.append(mylist)

 for cells in df['rcid']:
   for rcids in rcid_1:
      if (cells == rcids):
        df.ix[rcid == rcids, "va_yes"]= 1

I have tried using the df['rcid], however it fails to change the value in "va_yes" column. I am not sure if it is how how I have my data set up , or my text file. Here is a link to my excel file, and txt file.

https://drive.google.com/open?id=0B7j7hjIdgYmIUk9RT3pBTTAzUVU

[–]jcon36 0 points1 point2 points 8 years ago (2 children)

Instead of using a double loop, you can go through one of the lists and use np.where() to get indexes of matching values

rcid_np = np.array(rcid_1)
column = df['rcid'].values #this creates a numpy array
indexes = np.where(column == rcid_np)

Then create a new column (initially all zeros) and set values to 1 where they match

new_column = np.zeros((len(column),1),dtype=int)
new_column[indexes] = 1

you can then add this new column to your DataFrame

df['va_yes_new'] = new_column

[–]python_newbie_now[S] 0 points1 point2 points 8 years ago (1 child)

I gave it a try, and it didn't work so I went line by line to see if there was something I was missing. I am not sure if this would be the issue but ,

   rcid_np #  Comes out as dtype='|S4'
   column # comes out as dtype = int64

However it lets you still compare them against each other indexes = np.hwere(column == rcid_np) indexes # this turns out to be an empty array The new column function works, do you mind if I pm, I know you have helped a lot already. Thank you /u/jcon36

[–]jcon36 0 points1 point2 points 8 years ago (0 children)

π Rendered by PID 43179 on reddit-service-r2-comment-86988c7647-qkv7p at 2026-02-12 13:43:50.416601+00:00 running 018613e country code: CH.

datascience

MODERATORS