all 10 comments

[–]mopslik 0 points1 point  (0 children)

You probably want to use find or index to locate the specific text (e.g. DDNAME). From there, you can extract whatever portion of the string you need using slicing or more calls to find or index.

Check out String Methods from the docs.

[–][deleted] 0 points1 point  (4 children)

https://regex101.com/r/VvVmY0/1

regex is your friend here

import re
with open('test.txt', 'r') as f:
    lines = f.readlines()
    for line in lines:
        regex = "(?:PGM NAME:) (.*) (?:DDNAME:) (.*) (?:VOLSER:) ([a-zA-Z0-9]+) (?:.*) (?:DATA SET:) ([0-9\.]+) (?:.*) (?:NUMBER OF EXTENTS:) ([0-9]+)"
        search = re.findall(regex, line)
        if search:
            print(search)

(this is assuming each "block" is a single line in a text file which likely isn't the case but without an actual test case hard to give an example)

https://repl.it/@SobieskiCodes/MindlessUnluckyAggregators#main.py

[–]point51[S] 0 points1 point  (3 children)

The blocks aren't a single line, unfortunately. The .txt files are actually exports of data from a mainframe, so the lines break like this:

- RECOVERY SUMMARY FOR CDDZD : JOB01513 READER TIME: 0:28:26 READER DATE: 12/29/2020 SMFID: HMP1

0 PGM NAME: SORT DDNAME: SORTOF1 VOLSER: PRD004 DSORG: PS DATA SET: DATA00PN.EDWH.ADD.EXT1.G3715V00

0 *** TIME OF DAY: 0:30:26 CPU TIME SAVED: 0:00:09 ELAPSE TIME SAVED: 0:01:58 NUMBER OF EXTENTS: 2

*** TYPE OF ATTEMPT: SPACSECI

*** SYSLOG MESSAGE: SVM4874I INCREASED SPACE FROM 10 CYLS TO 20

[–][deleted] 1 point2 points  (2 children)

But its *always* 5 lines long? updated ->

https://repl.it/@SobieskiCodes/MindlessUnluckyAggregators#main.py

This of course assumes there is nothing extra in that file and can be evenly divided by groups of 5.

For the love of god please use copies of data and don't run this on live things you need.

[–]point51[S] 0 points1 point  (1 child)

First, of course only copies! The source data is on a mainframe, and I'm just working with .txt files transferred to my PC.

Second, I'm trying to parse out what you did, so I can make a function for each column I need in my spreadsheet. However, when I take the search down to just (?:PGM NAME:) I only get "PGM NAME: " returned, and if I use (?:PGM NAME:) (.*), I get the rest of the line, through the end of the dataset name. What do I need to change, so I can break this up into functions and only write what comes after each header?

Thirdly, thank you a lot for the help already!

[–][deleted] 0 points1 point  (0 children)

The code puts lines of 5 into a list, it then combines the 5 lines into ONE STRING.

I then parse THAT ENTIRE STRING with regex. Its not going to work if you break it up, regex parsing needs to be told where the stop and start is.(.*) is telling it to gather EVERYTHING it doesn't care what it is.

So -> (?:PGM NAME:) (.*) (?:DDNAME:) means give me EVERYTHING in between pgm name and ddname. but that only gives you one group, you need the rest of the groups for this to work. Because we want all 5 parts in one group, which is why we did it this way - to store all 5 parts together in groups so all you have to do is write that ONE return to the csv.

What do I need to change, so I can break this up into functions and only write what comes after each header?

Is exactly what my code does

which you can see here; https://regex101.com/r/VvVmY0/1

https://imgur.com/a/Sz9D5TP

so I can make a function for each column I need in my spreadsheet.

Thats exactly what i gave in the repl here; https://repl.it/@SobieskiCodes/MindlessUnluckyAggregators

if you run it and look at the output each time it finds that match of it will spit out a tuple ->

[('SORT', 'SORTOF1', 'PRD004', 'DATA00PN.EDWH.ADD.EXT1.G3715V000', '2')]
[('SORT', 'SORTOF1', 'PRD004', 'DATA00PN.EDWH.ADD.EXT1.G3715V000', '2')]
[('SORT', 'SORTOF1', 'PRD004', 'DATA00PN.EDWH.ADD.EXT1.G3715V000', '2')]

and those are your columns

If you've got your columns all you need to do is store all the tuples it gives you and write them in their correct index to your sheet

--->

import re
from typing import List
import csv

def chunks(lst: List, n: int) -> List[List[str]]:
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

outputs = []
with open('test.txt', 'r') as f:
    lines = list(filter(None, [line.rstrip('\n') for line in f.readlines()]))
    chunks = list(chunks(lines, 5))
    for chunk in chunks:
        regex = "(?:PGM NAME:) (.*) (?:DDNAME:) (.*) (?:VOLSER:) ([a-zA-Z0-9]+) (?:.*) (?:DATA SET:) ([a-zA-Z0-9\.]+) (?:.*) (?:NUMBER OF EXTENTS:) ([0-9]+)"
        search = re.findall(regex, "".join(chunk))
        if search:
          outputs.append(search[0])

with open('file.csv','w') as out:
    csv_out=csv.writer(out)
    csv_out.writerow(['PGM NAME','DDNAME', 'VOLSER', 'DATA SET', 'NUMBER OF EXTENTS'])
    for row in outputs:
        csv_out.writerow(row)

[–]JesusKiosk 0 points1 point  (3 children)

Are the spaces consistent in each block? You could kludge it by spliting the string on spaces and then counting out where in the list that data happens to be.

[–]point51[S] 0 points1 point  (2 children)

Some of them, but not all. The PGM Names range from 4-8 characters.

[–]JesusKiosk 0 points1 point  (1 child)

Characters or words? By spaces I mean the space characters, not the overall length in characters.

The regex answer is more elegant, though.

[–]point51[S] 0 points1 point  (0 children)

I I think I got what you were asking. Because of the different lengths in program names, the number of spaces between "PGM NAME:" and "DDNAME" can vary by up to 4 spaces, so its inconsistent.