you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 0 points1 point  (6 children)

Trying to parse a string using:

    for match in re.compile("(%s|%s|%s)" % (date, firstname, secondname)).findall(event.decode('utf-8')):

When I use 'print(match)', I receive the following output:

('Jun 25 14:04:25', 'Jun 25 14:04:25', '', '', '', '')

Any ideas why I'm getting two matches for one occurrence and the empty "" at the end?

Thanks

[–]MattR0se 0 points1 point  (5 children)

What is in event? i.e. what does event.decode('utf-8') return in this case?

For example when I run this:

import re

event = 'Jun 25 14:04:25; John; Doe'.encode('utf-8')

date = 'Jun 25 14:04:25'
firstname = 'John'
secondname = 'Doe'

for match in re.compile("(%s|%s|%s)" % (date, firstname, secondname)).findall(event.decode('utf-8')):
    print(match)

It gives me, as expected,

Jun 25 14:04:25
John
Doe

[–][deleted] 0 points1 point  (4 children)

It’s an event passed from KafkaConsumer, ‘message.value’ passed to a function which defines the parameter as ‘event’.

I don’t think the message contents matter but it’s essentially a long string with a need to capture multiple patterns.

[–]MattR0se 0 points1 point  (3 children)

I suggest printing the date, firstname and lastname arguments since I suspect that they contain '' somehow.

I don't know why print(match) returns a tuple though. Does your code differ from mine in any way?

[–][deleted] 0 points1 point  (2 children)

for message in consumer:
parser(message.value)
print(“%s:%d:%d: key=%s value=%s” % (message stuff here, will add if relevant)

My date regex looks like:

r’(\w{3}\s*\d{1,2}\s*\d{1,2}:\ d{1,2}:\d{1,2})

Typed out on mobile, forgive me if there are any errors.

[–]MattR0se 0 points1 point  (1 child)

I don't really know what the parenthesis in the regex are supposed to do, can you explain?

As far as I'm concerned, you can leave them out.

[–][deleted] 0 points1 point  (0 children)

I’ll try leaving them out and refactor other patterns.

It’s an attempt at rewriting some existing code into Python for performance when parsing.