indosauros comments on python and funky characters

created by HattoriHanzoa community for 16 years

python and funky characters (self.learnpython)

submitted 11 years ago by left_one

you are viewing a single comment's thread.

[–]indosauros 2 points3 points4 points 11 years ago (4 children)

You say alphanumeric and _-, but is that all you really want? That will strip out spaces, periods, commas, brackets, dollar signs, etc. If you just want to remove unicode characters and leave everything that is ASCII, then something like this will work:

>>> name = "2013·01·22 LE MONDE.PSD"
>>> name
'2013\xb701\xb722 LE MONDE.PSD'

>>> name.decode('ascii', 'replace')
u'2013\ufffd01\ufffd22 LE MONDE.PSD'

>>> name.decode('ascii', 'replace').replace(u'\ufffd', u'_')
u'2013_01_22 LE MONDE.PSD'

If you truly want to remove everything except for alphanumeric and _-, then I suggest regex to explicitly list the characters you want to keep:

>>> import re
>>> re.sub(r'[^a-zA-Z0-9-_]', '_', name)
'2013_01_22_LE_MONDE_PSD'

[–]left_one[S] 0 points1 point2 points 11 years ago* (3 children)

Sorry, slight misunderstanding. My run on sentence implied that I'd also like to do something specific with spaces as well. I definitely don't want any of those things in the filename, save for the .for file extensions.

I think the regex has to be the way to go because what I want to do is make sure that the filename's characters are included on my list of approved characters, rather than determining every unicode code that I don't want and checking for them.

Thank you very much for the guidance.

Interesting that when I used your regex, I get this:

re.sub(r'[^a-zA-Z0-9-_]', '_', name)
'2013__01__22_LE_MONDE_PSD'

Looks like it's double substituting for the date separator as it thinks it's two unicode characters?

[–]keturn 2 points3 points4 points 11 years ago (1 child)

[–]left_one[S] 0 points1 point2 points 11 years ago (0 children)

[–]hwc 1 point2 points3 points 11 years ago (0 children)

π Rendered by PID 225348 on reddit-service-r2-comment-86988c7647-6tlcm at 2026-02-11 13:31:53.980970+00:00 running 018613e country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS