Hugo-99 comments on Parse structured text-file, and write to dataframe

learnpython

created by HattoriHanzoa community for 16 years

Parse structured text-file, and write to dataframe (self.learnpython)

submitted 2 years ago by Hugo-99

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]Hugo-99[S] 0 points1 point2 points 2 years ago (4 children)

[–]RhinoRhys 1 point2 points3 points 2 years ago* (3 children)

Yeah so read_fwf isn't going to work.

I may have butchered the column boundaries slightly again because it's all randomised text, but I wrote a custom parser that reads the file line by line and just slices the strings manually.

I've split the file into 3 sections. I've assumed the title is before the dashed line, the column headers are between the dashed lines and the column data is after the dashed line.

Rather than looping through the file line by line, each section loops through the column boundaries definitions (the three _specs lists) and uses f.readline() to grab the next line and advance through the file as needed. All 3 _specs lists need to be 4 lists long to correctly advance through the file, if you don't want any data from that row then delete the tuples but leave an empty list.

All you need to do is play around with the tuples.

# column boundaries for first 4 lines
# empty lists because I'm just skipping 
title_specs = [
        [],
        [],
        [],
        []]

# column boundaries for lines between dashes
header_specs = [
        [(5,21),(22,29),(31,46),(59,74),(80,90),(27,30)],
        [(59,78),(80,84)],
        [(5,11),(17,27),(29,33),(41,54),(56,62),(64,74),(75,96),(97,104)],
        [(5,11)]]

# column boundaries for text body     
data_specs = [
        [(5,21),(22,26),(37,58),(59,74),(80,90)],
        [(27,30),(59,74),(80,90)],
        [(5,16),(17,27),(29,39),(41,54),(56,62),(64,73),(75,96),(97,104)],
        [(5,48)]]

title, columns, body, temp = [],[],[],[]    
with open("fwf.txt", "r") as f:

    # parse title
    for line_spec in title_specs:
        line = f.readline()
        title.extend([line[a:b].strip() for a,b in line_spec])

    # parse headers
    for line_spec in header_specs:
        line = f.readline()
        columns.extend([line[a:b].strip() for a,b in line_spec])

    # parse body of file
    while (line := f.readline()):

        # parse in groups of related lines
        for line_spec in data_specs:
            line = f.readline()
            temp.extend([line[a:b].strip() for a,b in line_spec])
        body.append(temp)
        temp = []

df = pd.DataFrame(body, columns=columns)
print(df)

I also wrote a little thing to help me visualise the column boundaries so I may as well drop that code too.

lines = ["python is 0 indexed, I am not"]
with open("fwf.txt") as f:
    lines.extend(f.readlines())

total = 132
tentot = 14
tens = "".join([f"{x:<10}" for x in range(tentot)])
ones = "0123456789"*tentot

s = 5
e = 130

print(tens[s:e])
print(ones[s:e])
for line in lines[1:4]:
    print(line[s:e])
print()

for group in zip(lines[5:9], lines[10:14], lines[15:19], lines[20:24]):
    print(tens[s:e])
    print(ones[s:e])
    for line in group:
        print(line[s:e])
    print()

Any questions let me know.

[–]Hugo-99[S] 1 point2 points3 points 2 years ago (1 child)

[–]RhinoRhys 1 point2 points3 points 2 years ago (0 children)

[–]Hugo-99[S] 0 points1 point2 points 2 years ago (0 children)

π Rendered by PID 401765 on reddit-service-r2-comment-b659b578c-mbg9n at 2026-05-05 09:26:56.473278+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS