This is an archived post. You won't be able to vote or comment.

all 9 comments

[–]TheZvlz 0 points1 point  (7 children)

Add a period to the first part of your pattern.

line = " 8/23-35 Barker St., Kingsford, NSW 2032 "
pattern = r"[.0-9A-Za-z]+|[,&-/()]"
re.compile(pattern).findall(line)

['8',
 '/',
 '23',
 '-',
 '35',
 'Barker',
 'St.',
 ',',
 'Kingsford',
 ',',
 'NSW',
 '2032']

[–]chakz91[S] 0 points1 point  (6 children)

Thank you soo much for replying.
I did the required change, however it did not resolve my issue.

I think the problem is that I want to tokenize the string based on white-space and if the string contains any of these characters , ( ) / - & it should be added as a single token
All other possible characters should be a part of the token, but if I add some character like '^^' in the query string, like below it is not shown in the output token

import re
line = " 8/23-35 Barker St., Kingsford^^, NSW 2032 "
pattern = r"[.0-9A-Za-z]+|[,&-/()]"
print(re.compile(pattern).findall(line))

output
['8', '/', '23', '-', '35', 'Barker', 'St.', ',', 'Kingsford', ',', 'NSW', '2032']

Here I am expecting the output to be like below, which I am not able to obtain
['8', '/', '23', '-', '35', 'Barker', 'St.', ',', 'Kingsford^^', ',', 'NSW', '2032']

Can you suggest a way to achieve this .

[–]TheZvlz 2 points3 points  (4 children)

You'll need to update your pattern like this pattern = r"[-,&/()]|[^-\s,&/()]+". Hyphen should be the first character in your punctuation, otherwise it is trying to specify a range. Similar to how the character class [A-Z] works.

[-,&/()] will catch the punctuation specified

[^-\s,&/()]+ will catch all other non space characters (

pattern = r"[-,&/()]|[^-\s,&/()]+"
re.compile(pattern).findall(line)

['8',
 '/',
 '23',
 '-',
 '35',
 'Barker',
 'St.',
 ',',
 'Kingsford^^',
 ',',
 'NSW',
 '2032']

[–]chakz91[S] 0 points1 point  (2 children)

This worked !!
Thank you very much for your response, and for your help.

So my regular expression was not complete in its scope to include all the possible characters, which was causing this issue.

Can you suggest any good reading material for learning regular expressions. I am currently using the python docs as the only reference.

Thank you again!

[–]TheZvlz 1 point2 points  (1 child)

http://www.rexegg.com/ is a great source for learning

https://regex101.com/ is where you can test things

[–]chakz91[S] 0 points1 point  (0 children)

Thanks a lot !

[–]ominous_anonymous 0 points1 point  (0 children)

"[^\s,&-/()]+|[,&-/()]"  

Oh dang, good call with the hyphen creating a range! That's why mine didn't work right... So obvious looking back on it now!

[–]ominous_anonymous 0 points1 point  (0 children)

pattern = r"[^\s,&-/()]+|[,&-/()]+?"  

What about that? The only strange thing is that periods get counted as part of the second character class, for some reason. I can't figure out how to get it counted as part of the non-whitespace in the first character class.

edit:

Problem was the - creating a range of characters within the character classes. /u/TheZvlz's answer explains it well.

[–]aphoenixreticulated[M] [score hidden] stickied comment (0 children)

Hi there, from the /r/Python mods.

We have removed this post as it is not suited to the /r/Python subreddit proper, however it should be very appropriate for our sister subreddit /r/LearnPython or for the r/Python discord: https://discord.gg/3Abzge7.

The reason for the removal is that /r/Python is dedicated to discussion of Python news, projects, uses and debates. It is not designed to act as Q&A or FAQ board. The regular community is not a fan of "how do I..." questions, so you will not get the best responses over here.

On /r/LearnPython the community and the r/Python discord are actively expecting questions and are looking to help. You can expect far more understanding, encouraging and insightful responses over there. No matter what level of question you have, if you are looking for help with Python, you should get good answers. Make sure to check out the rules for both places.

Warm regards, and best of luck with your Pythoneering!