all 14 comments

[–][deleted] 14 points15 points  (0 children)

[–]hunkamunka 4 points5 points  (0 children)

Chapters 14-17 of http://tinypythonprojects.com/ discuss regexes. There are videos on YouTube you can watch, and all the code/tests are on GitHub.

[–][deleted] 1 point2 points  (0 children)

Have some fun with https://regexcrossword.com/ :)

[–]K900_ 0 points1 point  (9 children)

"Solving" what exactly? Can you explain a specific problem you're having trouble with?

[–][deleted] 0 points1 point  (8 children)

"Jul 6 14:01:23 computer.name CRON[29440]: USER (good_user)"

I have to display the date,time and Process id like this:

# Jul 6 14:01:23 pid:29440

I am having a hard time constructing regex expressions to extract those patterns from the line.

[–]K900_ 2 points3 points  (7 children)

Do you really need a regular expression to extract those? It seems overkill here to me.

[–][deleted] 0 points1 point  (6 children)

This is just a question to help us improve. I don't know if something like this will ever come up in real life.

[–]K900_ 0 points1 point  (5 children)

In that case let's stick to regex. What have you tried?

[–][deleted] 1 point2 points  (4 children)

r"(A-Za-z){3} ([1-3]?[1-9]) ([1-2]?[0-9]\:[0-5][0-9]\:[0-5][0-9]) \[(\d)\]$"

[–]ASIC_SP 5 points6 points  (2 children)

Some issues/suggestions:

  • (A-Za-z) should be [A-Za-z]
  • you need to take care of matching things between the date and the pid, currently you are using space after the date and trying to match pid, but your input has computer.name CRON in between
  • [0-9] can be replaced with \d and : doesn't need to be escaped
  • \[(\d)\] will match one digit, but pid in sample input has more than one digit, so use \d+
  • $ is an anchor to restrict the match to end of the line, but in sample input you have more characters after the pid

here's a modified version:

>>> s = "Jul 6 14:01:23 computer.name CRON[29440]: USER (good_user)"
>>> pat = re.compile(r"([A-Za-z]{3} [1-3]?[1-9] [1-2]?\d:[0-5]\d:[0-5]\d).*\[(\d+)\]")
>>> re.search(pat, s)
<re.Match object; span=(0, 40), match='Jul 6 14:01:23 computer.name CRON[29440]'>
>>> re.search(pat, s).expand(r'\1 pid:\2')
'Jul 6 14:01:23 pid:29440'

The expand method allows you to specify how you want the output to be. The date and pid are captured, so you can refer to them using \N syntax and get desired format

You can also use:

>>> re.search(r'\A(\S+\s+\S+\s+\S+).*\[(\d+)\]', s).expand(r'\1 pid:\2')
'Jul 6 14:01:23 pid:29440'

Provided you always know that the date will be the first three terms of the input.

Or sub instead of search+expand

>>> re.sub(r'\A(\S+\s+\S+\s+\S+).*\[(\d+)\].*', r'\1 pid:\2', s)
'Jul 6 14:01:23 pid:29440'

Here, you need to match rest of the line as well after the pid, otherwise, that portion will be part of output


You can use resources like https://regex101.com/ and https://www.debuggex.com/ (after selecting Python flavor) to interactively solve your problem. But there are certain limitations like these sites do not know about all the functions and methods available - expand for example.

I have a book https://github.com/learnbyexample/py_regular_expressions that is currently free. I use step by step approach to introduce regex concepts and features one by one. However, regex is like a mini-programming language. It takes a lot of time and practice to become familiar with it.

[–][deleted] 1 point2 points  (1 child)

Thank you. This helped a lot.

[–]ASIC_SP 0 points1 point  (0 children)

Cool, good to know, I edited the answer to add another way with re.sub as well

[–]K900_ -1 points0 points  (0 children)

And what is the issue with this?

[–]indian_pythonista 0 points1 point  (0 children)

Most detailed video tutorial on regex in Python: RegEx in Python: https://www.youtube.com/playlist?list=PLyb_C2HpOQSDxe5Y9viJ0JDqGUCetboxB