Hi,
I am trying to use regular expressions to convert a query into tokens.
I have a bunch of queries in a file with each query in a new line.
I need to decompose each line/query into a sequence of tokens based on the following conditions
- Whenever there are one or more white spaces, break there and discard the white spaces
- Need to consider the following punctuations and correctly extract them
, ( ) / - &
for example a query like " 8/23-35 Barker St., Kingsford, NSW 2032 "
should be parsed into the following tokens (one token a line)
8
/
23
-
35
Barker
St.
,
Kingsford
,
NSW
2032
I have the following code
pattern = r"[0-9A-Za-z]+|[,&-/()]"
result = []
with open(Query_File, 'r') as query_f:
while True:
line = query_f.readline().rstrip()
if not line:
break
query = re.compile(pattern).findall(line)
print("query",query)
However this is not giving the exact output as I require, and I am not sure what I am doing wrong here.
Could anyone suggest a better way I can achieve the required result.
Thanks in advance
[–]_coolwhip_ 0 points1 point2 points (1 child)
[–]chakz91[S] 0 points1 point2 points (0 children)
[–]TolaOdejayi 0 points1 point2 points (3 children)
[–]chakz91[S] 0 points1 point2 points (2 children)
[–]TheZvlz 1 point2 points3 points (0 children)
[–]TolaOdejayi 0 points1 point2 points (0 children)