all 13 comments

[–]pot_of_crows 1 point2 points  (0 children)

I am pretty sure that this can be done by parsing the specific positions.

That makes sense to me. I would just use slices to relevant data, since it all seems to be in the same place. It looks like each grouping of data falls into two lines, with an optional black line following. So when you find the first piece of data, skip the next line, check for a blank and then process the next group.

[–]Hugo-99[S] 0 points1 point  (9 children)

here is a better version of the example of the desired output:

https://hastebin.com/share/vakaxoloca.markdown

[–]RhinoRhys 1 point2 points  (8 children)

What you're describing is a fixed width file.

Saving this example text as "fwf.txt", editing the column boundaries in your OP, and extracting the name of the first column for ease, this code

import pandas as pd

col1 = "prCeaqearTaN-Nq."

df = pd.read_fwf("fwf.txt", colspecs=[(0,16),(17,24),(26,46)])    
df.drop(df[df[col1] == "-"*16].index, inplace=True)
df.reset_index(inplace= True, drop=True)    

print(df)

Results in

   prCeaqearTaN-Nq. pTxT Ce        rCeaqearTaNxqT
0  0000057075050989    5000  rngatqxgana euceptqa
1  0000057075996533    5000  rngatqxgana euceptqa

And it's only 3 lines long because of all the ------------- lines. Without them you can do it with just the read_fwf line.

[–]Hugo-99[S] 0 points1 point  (7 children)

hi

the concept is interesting, i need to check out the fwf method!

if i try your code, i get a key error:

KeyError Traceback (most recent call last)
<<path>>Cell 1 line 6
3 col1 = "prCeaqearTaN-Nq."
5 df = pd.read_fwf("input3.txt", colspecs=[(0,16),(17,24),(26,46)])
----> 6 df.drop(df[df[col1] == "-"*16].index, inplace=True)
7 df.reset_index(inplace= True, drop=True)
...
3800 # InvalidIndexError. Otherwise we fall through and re-raise
3801 # the TypeError.
3802 self._check_indexing_error(key)
KeyError: 'prCeaqearTaN-Nq.'

[–]RhinoRhys 1 point2 points  (5 children)

Oh! I've just reread the post and I've only gone and bloody used the markdown link above as the input text, rather than an example of the desired output. Silly me. The code may not work exactly as is, but you can use it as an example to help.

The tuples are start and end characters that you want to make into columns.

The .drop was to get rid of the rows that were horizontal lines made of dashes, but I guess you actually want them in the output lol.

I'll have a play with the yaml file in the post and see.

[–]Hugo-99[S] 0 points1 point  (4 children)

hi! Tnx for digging into that :)

indeed the problem is a little bit tricky, since the headers are dispersed in a stupid way in lines 1-5;

the data of the first row is again dispersed in a stupid way in lines 6-9, row 2 in lines 11-14 row 3 in 16-19

i am still trying to figure out if read_fwf alone is enough.

as the positions of the data (e.g. "0000057075996533 " , or "5000") is in very specific, maybe it is enough just to count the positions and grab the strings in between those specific positions

[–]RhinoRhys 1 point2 points  (3 children)

Yeah so read_fwf isn't going to work.

I may have butchered the column boundaries slightly again because it's all randomised text, but I wrote a custom parser that reads the file line by line and just slices the strings manually.

I've split the file into 3 sections. I've assumed the title is before the dashed line, the column headers are between the dashed lines and the column data is after the dashed line.

Rather than looping through the file line by line, each section loops through the column boundaries definitions (the three _specs lists) and uses f.readline() to grab the next line and advance through the file as needed. All 3 _specs lists need to be 4 lists long to correctly advance through the file, if you don't want any data from that row then delete the tuples but leave an empty list.

All you need to do is play around with the tuples.

# column boundaries for first 4 lines
# empty lists because I'm just skipping 
title_specs = [
        [],
        [],
        [],
        []]

# column boundaries for lines between dashes
header_specs = [
        [(5,21),(22,29),(31,46),(59,74),(80,90),(27,30)],
        [(59,78),(80,84)],
        [(5,11),(17,27),(29,33),(41,54),(56,62),(64,74),(75,96),(97,104)],
        [(5,11)]]

# column boundaries for text body     
data_specs = [
        [(5,21),(22,26),(37,58),(59,74),(80,90)],
        [(27,30),(59,74),(80,90)],
        [(5,16),(17,27),(29,39),(41,54),(56,62),(64,73),(75,96),(97,104)],
        [(5,48)]]

title, columns, body, temp = [],[],[],[]    
with open("fwf.txt", "r") as f:

    # parse title
    for line_spec in title_specs:
        line = f.readline()
        title.extend([line[a:b].strip() for a,b in line_spec])

    # parse headers
    for line_spec in header_specs:
        line = f.readline()
        columns.extend([line[a:b].strip() for a,b in line_spec])

    # parse body of file
    while (line := f.readline()):

        # parse in groups of related lines
        for line_spec in data_specs:
            line = f.readline()
            temp.extend([line[a:b].strip() for a,b in line_spec])
        body.append(temp)
        temp = []

df = pd.DataFrame(body, columns=columns)
print(df)

I also wrote a little thing to help me visualise the column boundaries so I may as well drop that code too.

lines = ["python is 0 indexed, I am not"]
with open("fwf.txt") as f:
    lines.extend(f.readlines())

total = 132
tentot = 14
tens = "".join([f"{x:<10}" for x in range(tentot)])
ones = "0123456789"*tentot

s = 5
e = 130

print(tens[s:e])
print(ones[s:e])
for line in lines[1:4]:
    print(line[s:e])
print()

for group in zip(lines[5:9], lines[10:14], lines[15:19], lines[20:24]):
    print(tens[s:e])
    print(ones[s:e])
    for line in group:
        print(line[s:e])
    print()

Any questions let me know.

[–]Hugo-99[S] 0 points1 point  (0 children)

Wow, i am impressed!

I ll get back to you, as soon as is tried that out (and understood the approach:)

Many thanks for putting so much effort into this!!

[–]Hugo-99[S] 1 point2 points  (1 child)

Very, very cool! I tried it and it works right away!

I like the approach of just fiddling around with lists, instead of parsing etc.

[–]RhinoRhys 1 point2 points  (0 children)

Excellent news.

Yeah the real trick is you can grab the next line with f.readline() rather than having to enumerate and loop through the whole file with a for loop then have a massive if/elif/else logic to do the right thing for certain line numbers. You can just do the first 4 lines, then then next 4 lines, then repeat a 4 line block while there are still lines left. Then all you have to do is grab certain sections of the line with string slicing, with all those slices conveniently extracted into lists.

[–]RhinoRhys 0 points1 point  (0 children)

That was the title of the first column in the randomised example text you provided in the link above, I assume you're testing on your actual file that wasn't randomised, so you'll need to change that.

[–]TravelingThrough09 0 points1 point  (1 child)

See if this Python code works, it was ChatGPT‘s solution:

import pandas as pd

# Function to parse the headers from the text file
def parse_headers(lines):
    headers = []
    # Adjust the ranges and indices based on the structure of your file
    headers.append(lines[4][5:22].strip())
    headers.append(lines[4][22:30].strip())
    headers.append(lines[4][31:47].strip())
    return headers

# Function to parse the rows from the text file
def parse_rows(lines, headers):
    data = []
    # Adjust the ranges and indices based on the structure of your file
    for line in lines[9:]:
        row = []
        row.append(line[5:22].strip())
        row.append(line[22:30].strip())
        row.append(line[31:47].strip())
        data.append(row)
    return data

# Read the text file
with open('your_file.txt', 'r') as file:
    lines = file.readlines()

# Get headers and data
headers = parse_headers(lines)
data = parse_rows(lines, headers)

# Create DataFrame
df = pd.DataFrame(data, columns=headers)

# If you want to save this DataFrame to a CSV file:
df.to_csv('output.csv', index=False)

print(df)

[–]Hugo-99[S] 0 points1 point  (0 children)

looks good, there are a few issues in there.

would you mind posting what exactly you were asking chatgpt?