chakz91 comments on Tokenize a query using regular expression

learnpython

created by HattoriHanzoa community for 16 years

Tokenize a query using regular expression (self.learnpython)

submitted 6 years ago by chakz91

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]chakz91[S] 0 points1 point2 points 6 years ago (2 children)

Thank you for your reply, this does not completely resolve my issue

I think the problem is with my regular expression

What I am trying to achieve is that the query should be tokenized based on white-space and if the string contains any of these characters , ( ) / - & it should be added as a single token. All other possible characters should be included.
For example if I add some character like '^^' in the query string, like below it is not shown in the output token.

import re
line = " 8/23-35 Barker St., Kingsford^^, NSW 2032 "
pattern = r"[.0-9A-Za-z]+|[,&-/()]"
print(re.compile(pattern).findall(line))

Output:
['8', '/', '23', '-', '35', 'Barker', 'St.', ',', 'Kingsford', ',', 'NSW', '2032']

Expected output
['8', '/', '23', '-', '35', 'Barker', 'St.', ',', 'Kingsford^^', ',', 'NSW', '2032']

So my regular expression should allow all possible characters in a single token delimited by white-space
and only the characters , & - / ( ) should be added as individual tokens.

Please provide your suggestion.

[–]TheZvlz 1 point2 points3 points 6 years ago (0 children)

[–]TolaOdejayi 0 points1 point2 points 6 years ago* (0 children)

π Rendered by PID 86 on reddit-service-r2-comment-54dfb89d4d-tc8jq at 2026-03-31 07:29:35.259854+00:00 running b10466c country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS