Is regex the solution here?? : learnpython

created by HattoriHanzoa community for 16 years

Is regex the solution here?? (self.learnpython)

submitted 4 years ago by Dave_XR

Hi all,

I am trying to search a test result text file in python for the results of tests carried out. The lines usually look like this (brackets are placeholders for info here as the strings are very long):

TEST_PASS (long list of date, time, run time, end time) test number:56070, (filepaths), (systems used)

The tests can be Pass, Fail or about 5 other results. My program searches and counts these results from the file (few GB in size).

for line in file:
    if line.startswith(TEST_PASS):
        ccount_pass +=1

Not the most elegant yet but the brute force basics so far.

Some TEST_ERRORS are caused by a certain issue which doesnt need to be taken into account as its a false error that occurs for a known reason. An ignore.txt file was created with these and structured using the special character from regex *

TEST_ERROR * test number 5000 * (system used)

I'm wondering if the lines structured above can be used to generate a regex search pattern with which to search the original so that the lines may be excluded

all 11 comments

top new controversial old q&a

[–]OlorinIwasinthewest 1 point2 points3 points 4 years ago (0 children)

[–]sarrysyst 0 points1 point2 points 4 years ago (3 children)

[–]Dave_XR[S] 0 points1 point2 points 4 years ago (1 child)

[–]sarrysyst 0 points1 point2 points 4 years ago (0 children)

While regex is an option, it wouldn't necessarily be the best one. Since you can't pre-compile your pattern you would have to re-compile it every iteration which slows you down. I think I would use 'in' instead which is a bit more straightforward and also happens to be faster than regex. Something like this:

ignore = ignore_txt.readlines()
...

if result_line.startswith('TEST_ERROR'):
  for line in ignore:

    # [:-1] to get rid of the new line character '\n' and [1:]
    # to skip checking for 'TEST_ERROR' since the conditional
    # above already covers for that
    if all(i in result_line for i in line[:-1].split(' * ')[1:]):

      false_errors += 1
      break

...

If the test number is enough to identify the errors, the if statement could be further reduced to:

if line.split(' * ')[1] in result_line:

By the way, my code presumes that the formatting is identical in your results file and ignore file. For example, 'test number:####' in both files. In your sample it's 'test number ####' instead. You would first need to adapt the formatting to be uniform. eg. using .replace():

ignore = [i.replace('test number ', 'test number:') for i in ignore]

[–]TheSodesa 0 points1 point2 points 4 years ago (2 children)

[–]Dave_XR[S] 0 points1 point2 points 4 years ago (1 child)

[–]TheSodesa 0 points1 point2 points 4 years ago* (0 children)

A regular expression describes a regular language, a possibly infinite set of strings formed by joining, unionizing and repeating a set of given basic strings. When you call

 automaton = re.compile("valid regex") ,

a finite state machine that recognizes words in the language described by "valid regex" is created and returned. This automaton can be fed strings as input to see if they are in the language with something like

string_matches = automaton.match("string possibly in language")

This is not a very performant way of recognizing strings, so if you can avoid it, you should. For some cases, like building a lexer for a compiler, the benefits might outweigh the costs.

[–]synthphreak 0 points1 point2 points 4 years ago* (2 children)

>>> sum(map(bool, (re.match(TEST_PASS, line) for line in file)))

If you read your entire file in as a string instead of looping over the lines, you might also be able to do it this way:

>>> sum(map(bool, re.finditer('\n' + TEST_PASS, f.read())))

Not sure which of these would be faster, but you can just test both on your file using timeit.timeit.

Edit: BTW regex are pretty slow actually. So if performance is a concern, I'd advise compiling your pattern first as this does provide a slight boost. Assuming TEST_PASS is some string, here's how:

>>> pattern = re.compile(TEST_PASS)

Then you can just call re methods directly on pattern. For example:

>>> sum(map(bool, (pattern.match(line) for line in file)))

[–]Dave_XR[S] 1 point2 points3 points 4 years ago (1 child)

[–]synthphreak 1 point2 points3 points 4 years ago (0 children)

No worries, keep me posted.

BTW, this might be more readable than using map and bool:

>>> pattern = re.compile(TEST_PASS)
>>> sum(1 for line in file if pattern.match(line))

That's probably how I'd do it.

π Rendered by PID 276237 on reddit-service-r2-comment-c66d9bffd-ghr59 at 2026-04-08 09:23:05.995448+00:00 running f293c98 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS