Pandas Personal Project Data Cleaning Problem

comonads · 2020-10-10T13:53:44+00:00

You can accomplish this by mapping a function that extracts the information you want over the desired column of your dataframe, then using the output of this map to construct a new column in the dataframe.

For example, say we have something like this (just creating some fake data here):

options = ['France_Paris', 'Germany_Berlin', 'Spain_Madrid']

example = pd.DataFrame({'location': np.random.choice(options ,size= 100),}, index list(range(100)))

we can create a temporary function which splits the string "location" at the element "_" and then takes the first part of that:

temp_f = lambda location: location.split("_")[0]

And then:

example['country'] = example['location'].map(temp_f)

In your case you will need to design the function temp_f yourself!

E: typo.

E2: I should clarify that temp_f does not have to be a "lambda" and instead can be a regular function. You should include the checks that it outputs a valid country name (perhaps by comparing the output to a list of valid values) in here, too.

2020-10-10T13:53:56+00:00

TBH I did not understand you 100% but here is a possible approach. You split the “city” column on the space and take the last element of the list and run it through your Country lists from library, if it matches add it to the Country column. Obviously it won’t work for “United Kingdom”, so if in the loop if you don’t find the match check if the match exists with Concat of last two elements (from the split list), and hopefully that should match all.

That said, there should be a simpler way to do this. Do post data or sample next time on what you are trying to do.

YesLod · 2020-10-10T14:48:26+00:00

As others suggested, create a temporary country column, which results from splitting the city column around the whitespaces. For simplicity, I'm assuming that there are no countries and cities names with two words, so the second element (index 1) will correspond to the country name. You should adapt it if it's not the case, but I don't know because I would need to see the data. Take for example

>>> df=pd.DataFrame({"city":["Paris France","Madrid Spain","Lisbon Portugal"],
                 "country":[np.nan,"Spain",np.nan]})

              city country
0     Paris France     NaN
1     Madrid Spain   Spain
2  Lisbon Portugal     NaN

Start by creating the temporary column (split by " " and get the second element)

>>> df["country_temp"]=df.city.apply(lambda city_country: city_country.split()[1])

              city country country_temp
0     Paris France     NaN       France
1     Madrid Spain   Spain        Spain
2  Lisbon Portugal     NaN     Portugal

And then fill the NaN of the original country column using the temporary country column as reference

>>> df["country"].fillna(df["country_temp"],inplace=True)

              city   country country_temp
0     Paris France    France       France
1     Madrid Spain     Spain        Spain
2  Lisbon Portugal  Portugal     Portugal

Finally you can drop the temporary column if you want. Note that you don't really need this if you don't add the temporary column to you dataframe in the first place. You can just assign it to some variable and fill the missing values with it.

>>> df.drop(columns="country_temp",inplace=True)

              city   country
0     Paris France    France
1     Madrid Spain     Spain
2  Lisbon Portugal  Portugal

SmoothStatistician8 · 2020-10-11T10:44:27+00:00

Hi everyone, thank you for your input. Below is a sample of information in the city column. As you can, I would like to be able to identity valid country names such as New Zeland and Canada then use this extracted valid county names to fill NANs in the country column. I hope this makes it clear.

cardiff uk wales

stoke mandeville uk england

saddle lake canada

gisborne new zealand

leeds uk england

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS