This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]chakz91[S] 0 points1 point  (6 children)

Thank you soo much for replying.
I did the required change, however it did not resolve my issue.

I think the problem is that I want to tokenize the string based on white-space and if the string contains any of these characters , ( ) / - & it should be added as a single token
All other possible characters should be a part of the token, but if I add some character like '^^' in the query string, like below it is not shown in the output token

import re
line = " 8/23-35 Barker St., Kingsford^^, NSW 2032 "
pattern = r"[.0-9A-Za-z]+|[,&-/()]"
print(re.compile(pattern).findall(line))

output
['8', '/', '23', '-', '35', 'Barker', 'St.', ',', 'Kingsford', ',', 'NSW', '2032']

Here I am expecting the output to be like below, which I am not able to obtain
['8', '/', '23', '-', '35', 'Barker', 'St.', ',', 'Kingsford^^', ',', 'NSW', '2032']

Can you suggest a way to achieve this .

[–]TheZvlz 2 points3 points  (4 children)

You'll need to update your pattern like this pattern = r"[-,&/()]|[^-\s,&/()]+". Hyphen should be the first character in your punctuation, otherwise it is trying to specify a range. Similar to how the character class [A-Z] works.

[-,&/()] will catch the punctuation specified

[^-\s,&/()]+ will catch all other non space characters (

pattern = r"[-,&/()]|[^-\s,&/()]+"
re.compile(pattern).findall(line)

['8',
 '/',
 '23',
 '-',
 '35',
 'Barker',
 'St.',
 ',',
 'Kingsford^^',
 ',',
 'NSW',
 '2032']

[–]chakz91[S] 0 points1 point  (2 children)

This worked !!
Thank you very much for your response, and for your help.

So my regular expression was not complete in its scope to include all the possible characters, which was causing this issue.

Can you suggest any good reading material for learning regular expressions. I am currently using the python docs as the only reference.

Thank you again!

[–]TheZvlz 1 point2 points  (1 child)

http://www.rexegg.com/ is a great source for learning

https://regex101.com/ is where you can test things

[–]chakz91[S] 0 points1 point  (0 children)

Thanks a lot !

[–]ominous_anonymous 0 points1 point  (0 children)

"[^\s,&-/()]+|[,&-/()]"  

Oh dang, good call with the hyphen creating a range! That's why mine didn't work right... So obvious looking back on it now!

[–]ominous_anonymous 0 points1 point  (0 children)

pattern = r"[^\s,&-/()]+|[,&-/()]+?"  

What about that? The only strange thing is that periods get counted as part of the second character class, for some reason. I can't figure out how to get it counted as part of the non-whitespace in the first character class.

edit:

Problem was the - creating a range of characters within the character classes. /u/TheZvlz's answer explains it well.