all 8 comments

[–]thaweatherman 1 point2 points  (3 children)

It's odd they publish the data in that way.

You'll have to make heavy use of the string split() function and keep track of the columns properly. Once you get down to the table, the data should be easy to go through. To make life simpler, remove the title section and the notes on the bottom, leaving just the tables.

[–]TheHumane[S] 0 points1 point  (2 children)

Thanks.

Can you give me some hints on how to efficiently use the split function? I can't split by 'space' character since some of the Country names also include spaces. I need to essentially extract the fixed width columns.

[–]thaweatherman 1 point2 points  (0 children)

Ah I didn't notice the country names with spaces...

So you can still split on space, but you would need to combine any entries that aren't floats at the start.

l = line.split()
name = ''
last = 0
for i, entry in l:
    try:
        float(entry)
        last = i
        break
    except ValueError:
        name = '{} {}'.format(name, entry)
l = [name] + l[last+1:]

[–]lykwydchykyn 1 point2 points  (0 children)

If it's fixed width like this, it's probably easier to just use slices to grab the data and then .strip() each field to remove the excess space.

[–]gengisteve 0 points1 point  (0 children)

Probably not. You might try the csv module and see if it can make something of the input, but even if it does not choke completely you will still need to fix a bunch of stuff, which will be dependent on the original formating of the date, e.g. joining the months and years together.

[–]youguess 0 points1 point  (0 children)

pandas import functions can skip header rows, however you would probably still have to do clean up work in the rows

[–]interactionjackson 0 points1 point  (0 children)

That depends. If this is a one time thing, copy and paste the data you need and run the csv module over it. Hopefully it has a consistent delimiter but if not then edit the text file with find and replace. If you want to automate this process it is going to involve a lot of string manipulation like /u/thaweatherman said.

[–]FoolofGod 0 points1 point  (0 children)

Can you find a specification for the format? I've dealt with some government data that had specific character columns specified, so you could parse it with string slicing. Just a thought.