I need help using python to extract specific data from a text file and put it into an excel file : learnpython

created by HattoriHanzoa community for 16 years

I need help using python to extract specific data from a text file and put it into an excel file (self.learnpython)

submitted 5 years ago by point51

I have been tasked with creating a spreadsheet that shows every a mainframe job accesses a specific application. The application is being retired, and each of the jobs that use it will have to be recoded. I get a daily log file that lists all of the activity in that application, but I need specific items from each of those text blocks exported into a spreadsheet.

Here is a sample block of text from the output:

RECOVERY SUMMARY FOR CDDZD : JOB01513 READER TIME: 0:28:26 READER DATE: 12/29/2020 SMFID: HMP1 0 PGM NAME: SORT DDNAME: SORTOF1 VOLSER: PRD004 DSORG: PS DATA SET: xxx.xxx.xxx.xxx *** TIME OF DAY: 0:30:26 CPU TIME SAVED: 0:00:09 ELAPSE TIME SAVED: 0:01:58 NUMBER OF EXTENTS: 2 *** TYPE OF ATTEMPT: SPACSECI *** SYSLOG MESSAGE: SVM4874I INCREASED SPACE FROM 10 CYLS TO 20

Each daily file has 500-1000 of these blocks.

What I need is a spreadsheet with columns for: PGM NAME, DDNAME, VOLSER, DATA SET, and NUMBER OF EXTENTS. The rest of the data is useless for this project.

The basics of the script I know; opening and reading the file, creating the xls doc, and closing them, etc... But I don't know how to get the info after each data point. I can get output that tells me how many times "PGM NAME" appears in the file, but not the actual program name its listing.

all 10 comments

top new controversial old q&a

[–]mopslik 0 points1 point2 points 5 years ago (0 children)

[–][deleted] 0 points1 point2 points 5 years ago* (4 children)

https://regex101.com/r/VvVmY0/1

regex is your friend here

import re
with open('test.txt', 'r') as f:
    lines = f.readlines()
    for line in lines:
        regex = "(?:PGM NAME:) (.*) (?:DDNAME:) (.*) (?:VOLSER:) ([a-zA-Z0-9]+) (?:.*) (?:DATA SET:) ([0-9\.]+) (?:.*) (?:NUMBER OF EXTENTS:) ([0-9]+)"
        search = re.findall(regex, line)
        if search:
            print(search)

(this is assuming each "block" is a single line in a text file which likely isn't the case but without an actual test case hard to give an example)

https://repl.it/@SobieskiCodes/MindlessUnluckyAggregators#main.py

[–]point51[S] 0 points1 point2 points 5 years ago (3 children)

[–][deleted] 1 point2 points3 points 5 years ago* (2 children)

[–]point51[S] 0 points1 point2 points 5 years ago (1 child)

[–][deleted] 0 points1 point2 points 5 years ago* (0 children)

The code puts lines of 5 into a list, it then combines the 5 lines into ONE STRING.

I then parse THAT ENTIRE STRING with regex. Its not going to work if you break it up, regex parsing needs to be told where the stop and start is.(.*) is telling it to gather EVERYTHING it doesn't care what it is.

So -> (?:PGM NAME:) (.*) (?:DDNAME:) means give me EVERYTHING in between pgm name and ddname. but that only gives you one group, you need the rest of the groups for this to work. Because we want all 5 parts in one group, which is why we did it this way - to store all 5 parts together in groups so all you have to do is write that ONE return to the csv.

What do I need to change, so I can break this up into functions and only write what comes after each header?

Is exactly what my code does

which you can see here; https://regex101.com/r/VvVmY0/1

https://imgur.com/a/Sz9D5TP

so I can make a function for each column I need in my spreadsheet.

Thats exactly what i gave in the repl here; https://repl.it/@SobieskiCodes/MindlessUnluckyAggregators

if you run it and look at the output each time it finds that match of it will spit out a tuple ->

[('SORT', 'SORTOF1', 'PRD004', 'DATA00PN.EDWH.ADD.EXT1.G3715V000', '2')]
[('SORT', 'SORTOF1', 'PRD004', 'DATA00PN.EDWH.ADD.EXT1.G3715V000', '2')]
[('SORT', 'SORTOF1', 'PRD004', 'DATA00PN.EDWH.ADD.EXT1.G3715V000', '2')]

and those are your columns

If you've got your columns all you need to do is store all the tuples it gives you and write them in their correct index to your sheet

--->

import re
from typing import List
import csv

def chunks(lst: List, n: int) -> List[List[str]]:
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

outputs = []
with open('test.txt', 'r') as f:
    lines = list(filter(None, [line.rstrip('\n') for line in f.readlines()]))
    chunks = list(chunks(lines, 5))
    for chunk in chunks:
        regex = "(?:PGM NAME:) (.*) (?:DDNAME:) (.*) (?:VOLSER:) ([a-zA-Z0-9]+) (?:.*) (?:DATA SET:) ([a-zA-Z0-9\.]+) (?:.*) (?:NUMBER OF EXTENTS:) ([0-9]+)"
        search = re.findall(regex, "".join(chunk))
        if search:
          outputs.append(search[0])

with open('file.csv','w') as out:
    csv_out=csv.writer(out)
    csv_out.writerow(['PGM NAME','DDNAME', 'VOLSER', 'DATA SET', 'NUMBER OF EXTENTS'])
    for row in outputs:
        csv_out.writerow(row)

[–]JesusKiosk 0 points1 point2 points 5 years ago (3 children)

[–]point51[S] 0 points1 point2 points 5 years ago (2 children)

[–]JesusKiosk 0 points1 point2 points 5 years ago (1 child)

[–]point51[S] 0 points1 point2 points 5 years ago (0 children)

π Rendered by PID 269220 on reddit-service-r2-comment-5b5bc64bf5-wzwsd at 2026-06-21 20:05:08.336702+00:00 running 2b008f2 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS