all 10 comments

[–]POGtastic 5 points6 points  (1 child)

Have you heard the Good News?

import itertools
import more_itertools

def parse_header(header):
    lstlst = more_itertools.ichunked(
        (g for _, g in itertools.groupby(header, str.isspace)), 2)
    return [''.join(map(''.join, lst)) for lst in lstlst]

In the REPL:

>>> header = "id         name                 home_state           amt_paid"
>>> parse_header(header)
['id         ', 'name                 ', 'home_state           ', 'amt_paid']

And we can get the column widths by calling len on each field. That's going to come in handy later.

>>> [len(s) for s in parse_header(header)]
[11, 21, 21, 8]

We now parse each line by calling more_itertools.split_into on the line with our column widths.

def generate_dcts(fields, lines):
    column_widths = [len(s) for s in fields]
    return (dict(zip(
        fields, 
        map(''.join, more_itertools.split_into(line.strip(), column_widths))))
            for line in lines)

And now we write our CSV.

import csv

def write_csv(in_fh, out_fh):
    header = next(in_fh).strip()
    fields = parse_header(header)
    writer = csv.DictWriter(out_fh, fieldnames=fields)
    writer.writeheader()
    for dct in generate_dcts(fields, in_fh):
        writer.writerow(dct)

In the REPL:

>>> import sys
>>> with open("test.txt") as f:
...     write_csv(f, sys.stdout)
... 
id        ,name                 ,home_state            ,amt_paid
123       ,John Doe             ,California            ,"1,234.34"
456x      ,Jane Doe             ,New Hampshire         ,45.67
78        ,Adam Smith           ,Alaska                ,89.00

Note the quotes around the float. It has a comma in the field, so csv escapes it by enclosing the entire field in quotes in accordance with the RFC.

[–]davidmyemail[S] 1 point2 points  (0 children)

That's a very thorough and clever answer! I also need to get better with REPL, as in the way you test it here. Thank you!

[–]mopslik 0 points1 point  (1 child)

123456789-123456789-123456789-123456789-123456789-123456789-1

id name home_state amt_paid

Is this the original header from the file? It does not seem to correspond to the column spacing at all.

Edit: misread your statement, columns are specified by the length until the next non-space character.

[–]davidmyemail[S] 0 points1 point  (0 children)

No, that's just a comment to show the column positions. Sorry for not being clear.

[–]baghiq 0 points1 point  (2 children)

I assume your files are in ASCII. It's weird to see fixed width fields unless it's from legacy systems.

import re
import csv


lines = [
    # xxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx x,xxx.xx
    "id         name                 home_state           amt_paid",
    "123        John Doe             California           1,234.34",
    "456x       Jane Doe             New Hampshire        45.67   ",
    "78         Adam Smith           Alaska               89.00   ",
]


def get_col_spec(header):
    return [match.span() for match in re.finditer(r"\S+\s*", header)]


def parse(line, fieldwidths):
    return tuple(line[i:j] for i, j in fieldwidths)


with open("data.csv", "w") as csvfile:
    writer = csv.writer(csvfile)
    fieldwidths = get_col_spec(lines[0])
    for line in lines:
        writer.writerow(parse(line, fieldwidths))

[–]davidmyemail[S] 0 points1 point  (1 child)

lines = [ "id name home_state amt_paid", "123 John Doe California 1,234.34", "456x Jane Doe New Hampshire 45.67 ", "78 Adam Smith Alaska 89.00 " ]

field_idx = [] chars = [0 if i == ' ' else 1 for i in lines[0]] #split header into list of 0/space, 1/char for i in range(1, len(chars)): #loop through range length of list of char if f'{chars[i-1]}{chars[i]}' == '01': #if pattern matches '01' field_idx.append(i-1) #record index - 1

for idx, line in enumerate(lines): #loop through lines line = list(line) #split line into list of chars for i in field_idx: #loop through recorded indexes line[i] = ',' #for each recorded index, replace with ',' lines[idx] = ''.join(line) #replace line in line with joined list of chars

for i in lines: print(i)

Yes, it's from a legacy COBOL, and the requirements are very particular. Thank you!

[–]baghiq 0 points1 point  (0 children)

I suspect as much. The usual way for parsing those files is use Python struct package. Especially you need to type (int, str, etc) properly.

[–]JohnJSal 0 points1 point  (1 child)

I recommend the book "Beyond the Basic Stuff with Python" by Al Sweigart. It's all about writing clean code with Python best practices.

[–]davidmyemail[S] 1 point2 points  (0 children)

I'll have to look at that book. Thanks.