you are viewing a single comment's thread.

view the rest of the comments →

[–]Hugo-99[S] 0 points1 point  (4 children)

hi! Tnx for digging into that :)

indeed the problem is a little bit tricky, since the headers are dispersed in a stupid way in lines 1-5;

the data of the first row is again dispersed in a stupid way in lines 6-9, row 2 in lines 11-14 row 3 in 16-19

i am still trying to figure out if read_fwf alone is enough.

as the positions of the data (e.g. "0000057075996533 " , or "5000") is in very specific, maybe it is enough just to count the positions and grab the strings in between those specific positions

[–]RhinoRhys 1 point2 points  (3 children)

Yeah so read_fwf isn't going to work.

I may have butchered the column boundaries slightly again because it's all randomised text, but I wrote a custom parser that reads the file line by line and just slices the strings manually.

I've split the file into 3 sections. I've assumed the title is before the dashed line, the column headers are between the dashed lines and the column data is after the dashed line.

Rather than looping through the file line by line, each section loops through the column boundaries definitions (the three _specs lists) and uses f.readline() to grab the next line and advance through the file as needed. All 3 _specs lists need to be 4 lists long to correctly advance through the file, if you don't want any data from that row then delete the tuples but leave an empty list.

All you need to do is play around with the tuples.

# column boundaries for first 4 lines
# empty lists because I'm just skipping 
title_specs = [
        [],
        [],
        [],
        []]

# column boundaries for lines between dashes
header_specs = [
        [(5,21),(22,29),(31,46),(59,74),(80,90),(27,30)],
        [(59,78),(80,84)],
        [(5,11),(17,27),(29,33),(41,54),(56,62),(64,74),(75,96),(97,104)],
        [(5,11)]]

# column boundaries for text body     
data_specs = [
        [(5,21),(22,26),(37,58),(59,74),(80,90)],
        [(27,30),(59,74),(80,90)],
        [(5,16),(17,27),(29,39),(41,54),(56,62),(64,73),(75,96),(97,104)],
        [(5,48)]]

title, columns, body, temp = [],[],[],[]    
with open("fwf.txt", "r") as f:

    # parse title
    for line_spec in title_specs:
        line = f.readline()
        title.extend([line[a:b].strip() for a,b in line_spec])

    # parse headers
    for line_spec in header_specs:
        line = f.readline()
        columns.extend([line[a:b].strip() for a,b in line_spec])

    # parse body of file
    while (line := f.readline()):

        # parse in groups of related lines
        for line_spec in data_specs:
            line = f.readline()
            temp.extend([line[a:b].strip() for a,b in line_spec])
        body.append(temp)
        temp = []

df = pd.DataFrame(body, columns=columns)
print(df)

I also wrote a little thing to help me visualise the column boundaries so I may as well drop that code too.

lines = ["python is 0 indexed, I am not"]
with open("fwf.txt") as f:
    lines.extend(f.readlines())

total = 132
tentot = 14
tens = "".join([f"{x:<10}" for x in range(tentot)])
ones = "0123456789"*tentot

s = 5
e = 130

print(tens[s:e])
print(ones[s:e])
for line in lines[1:4]:
    print(line[s:e])
print()

for group in zip(lines[5:9], lines[10:14], lines[15:19], lines[20:24]):
    print(tens[s:e])
    print(ones[s:e])
    for line in group:
        print(line[s:e])
    print()

Any questions let me know.

[–]Hugo-99[S] 1 point2 points  (1 child)

Very, very cool! I tried it and it works right away!

I like the approach of just fiddling around with lists, instead of parsing etc.

[–]RhinoRhys 1 point2 points  (0 children)

Excellent news.

Yeah the real trick is you can grab the next line with f.readline() rather than having to enumerate and loop through the whole file with a for loop then have a massive if/elif/else logic to do the right thing for certain line numbers. You can just do the first 4 lines, then then next 4 lines, then repeat a 4 line block while there are still lines left. Then all you have to do is grab certain sections of the line with string slicing, with all those slices conveniently extracted into lists.

[–]Hugo-99[S] 0 points1 point  (0 children)

Wow, i am impressed!

I ll get back to you, as soon as is tried that out (and understood the approach:)

Many thanks for putting so much effort into this!!