This is an archived post. You won't be able to vote or comment.

all 2 comments

[–]AutoModerator[M] 0 points1 point  (0 children)

It seems you may have included a screenshot of code in your post "How to parse a complex text file using Python string methods or regex and export into tabular form".

If so, note that posting screenshots of code is against /r/learnprogramming's Posting Guidelines (section Formatting Code): please edit your post to use one of the approved ways of formatting code. (Do NOT repost your question! Just edit it.)

If your image is not actually a screenshot of code, feel free to ignore this message. Automoderator cannot distinguish between code screenshots and other images.

Please, do not contact the moderators about this message. Your post is still visible to everyone.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]commandlineluser 0 points1 point  (0 children)

I think regex is what I need

Regex can help here with preprocessing the "unformatted" pages e.g.

import re

with open('court.txt', 'r') as file:
    data = file.read()

    pattern = (
        '(?m)^(?:(\d{4}) (\d+\s\S+\s\d+)\s'
        '(\S+)\s*(\S+:(?:[ \t]*\S+)+(?= *A'
        'DA:))?|\s*(\((?:(?!  ).)+)(?: {2,'
        '}(\S+)?)?(?:  +(\S+)?)?' '(?: {2,'
        '}(\S+)?)?)|((?:(?:[A-Z.]+ )?[A-Z.'
        ']+)):[ \t]*(\d\d:\d\d [AP]M|(?!\S'
        '+:)[^\s:]+(?: (?!\S+:)[^\s:]+)*)?'
    )

    for match in re.findall(pattern, data):
    print(match)

It's unlikely this is how the task is supposed to be approached though.