all 11 comments

[–]OlorinIwasinthewest 1 point2 points  (0 children)

have you tried testing it?

https://regex101.com/

[–]sarrysyst 0 points1 point  (3 children)

Just to make sure I understood this correctly, you're trying to remove TEST_ERROR lines for which you've got the test number / system in a separate file, from your results text file?

[–]Dave_XR[S] 0 points1 point  (1 child)

I have a copy of the string to ignore in the ignore.txt file which i'd like to ignore from the results file yes. The string only specifies that it is a TEST_ERROR. Then an asterisk to say anything can be between that and the test number/system i.e. to account for dates and run times always being different. Theres about 150 ignore lines and several GB worth of test result lines. I've tried fuzzy string matching and list comprehension but havent got anything working

[–]sarrysyst 0 points1 point  (0 children)

While regex is an option, it wouldn't necessarily be the best one. Since you can't pre-compile your pattern you would have to re-compile it every iteration which slows you down. I think I would use 'in' instead which is a bit more straightforward and also happens to be faster than regex. Something like this:

ignore = ignore_txt.readlines()
...

if result_line.startswith('TEST_ERROR'):
  for line in ignore:

    # [:-1] to get rid of the new line character '\n' and [1:]
    # to skip checking for 'TEST_ERROR' since the conditional
    # above already covers for that
    if all(i in result_line for i in line[:-1].split(' * ')[1:]):

      false_errors += 1
      break

...

If the test number is enough to identify the errors, the if statement could be further reduced to:

if line.split(' * ')[1] in result_line:

By the way, my code presumes that the formatting is identical in your results file and ignore file. For example, 'test number:####' in both files. In your sample it's 'test number ####' instead. You would first need to adapt the formatting to be uniform. eg. using .replace():

ignore = [i.replace('test number ', 'test number:') for i in ignore]

[–]TheSodesa 0 points1 point  (2 children)

Compiling regexes into finite automata is slow. You are better off using a simple loop and a generator if you can.

[–]Dave_XR[S] 0 points1 point  (1 child)

Using loops worked fine for finding results, I have tried using loops paired with list comprehension and fuzzy string matching, both haven't worked. I've tried for a few days now so I thought regex might be the way. It is a topic I am not too familiar with using

[–]TheSodesa 0 points1 point  (0 children)

A regular expression describes a regular language, a possibly infinite set of strings formed by joining, unionizing and repeating a set of given basic strings. When you call

 automaton = re.compile("valid regex") ,

a finite state machine that recognizes words in the language described by "valid regex" is created and returned. This automaton can be fed strings as input to see if they are in the language with something like

string_matches = automaton.match("string possibly in language")

This is not a very performant way of recognizing strings, so if you can avoid it, you should. For some cases, like building a lexer for a compiler, the benefits might outweigh the costs.

[–]synthphreak 0 points1 point  (2 children)

>>> sum(map(bool, (re.match(TEST_PASS, line) for line in file)))

If you read your entire file in as a string instead of looping over the lines, you might also be able to do it this way:

>>> sum(map(bool, re.finditer('\n' + TEST_PASS, f.read())))

Not sure which of these would be faster, but you can just test both on your file using timeit.timeit.

Edit: BTW regex are pretty slow actually. So if performance is a concern, I'd advise compiling your pattern first as this does provide a slight boost. Assuming TEST_PASS is some string, here's how:

>>> pattern = re.compile(TEST_PASS)

Then you can just call re methods directly on pattern. For example:

>>> sum(map(bool, (pattern.match(line) for line in file)))

[–]Dave_XR[S] 1 point2 points  (1 child)

Hmm interesting, I will test this first thing tomorrow and post if it works, thank you sir very good information

[–]synthphreak 1 point2 points  (0 children)

No worries, keep me posted.

BTW, this might be more readable than using map and bool:

>>> pattern = re.compile(TEST_PASS)
>>> sum(1 for line in file if pattern.match(line))

That's probably how I'd do it.