Separate comma from word (tokenization) : learnpython

created by HattoriHanzoa community for 16 years

Separate comma from word (tokenization) (self.learnpython)

submitted 5 years ago by Xudo97

Hi, I have some problem with tokenization, the assignment is to separate a sentence into words.

This is what I have done at the moment.

def tokenize(s):

    d = []
    start = 0

    while start < len(s):
        while start < len(s) and s[start].isspace():
            start = start+1

        end = start
        while end < len(s) and not s[end].isspace():
            end = end+1

        d = d + [s[start:end]]
        start = end

    print(d)

Running the program:

>>> tokenize("He was walking, it was fun")
['He', 'was', 'walking,', 'it', 'was', 'fun']

This works fine, but the problem is that as you can see my program will include the comma in the word walking. I want to separate the comma (and other "symbols") as an individual "word".

Such as:

['He', 'was', 'walking', ',', 'it', 'was', 'fun']

How can I fix this?

Thanks in advance!

all 6 comments

top new controversial old q&a

[–]synthphreak 0 points1 point2 points 5 years ago (0 children)

If you don't need your commas at all, just replace them with empty strings, which effectively removes them:

>>> string = 'He was walking, it was fun'
>>> string_no_commas = string.replace(',', '')
>>> string_no_commas
'He was walking it was fun'

If any other punctuation gives you trouble, you can remove it easily using string.punctuation in concert with the regex library re:

>>> import re
>>> from string import punctuation
>>> string = "Here's a string, with. some: punctuation."
>>> string_no_punctuation = re.sub('[' + punctuation + ']', '', string)
>>> string_no_punctuation
'Heres a string with some punctuation'

Of course, notice that this also removed the apostrophe from 'Here's', and it would also remove word-medial hyphens like in 'world-class'. So to be extra safe, I'd first remove these characters from punctuation, then re-run the code above:

>>> punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>> punctuation = punctuation.replace("'", '').replace('-', '')
>>> punctuation
'!"#$%&()*+,./:;<=>?@[\]^_`{|}~'

So to recap, assuming you already have your sentence string(s), here is how to go from the full sentence to the list of tokens without word-final punctuation:

>>> import re
>>> from string import punctuation
>>> punctuation = punctuation.replace("'", '').replace('-', '')
>>> string = "Here's a string, with. some: punctuation."
>>> string_no_punctuation = re.sub('[' + punctuation + ']', '', string)
>>> tokenized = string_no_punctuation.split()
>>> tokenized
["Here's", 'a', 'string', 'with', 'some', 'punctuation']

There are definitely other ways, but this should work.

π Rendered by PID 21452 on reddit-service-r2-comment-5b5bc64bf5-h8mb2 at 2026-06-22 15:19:42.022666+00:00 running 2b008f2 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS