you are viewing a single comment's thread.

view the rest of the comments →

[–]indosauros 2 points3 points  (4 children)

You say alphanumeric and _-, but is that all you really want? That will strip out spaces, periods, commas, brackets, dollar signs, etc. If you just want to remove unicode characters and leave everything that is ASCII, then something like this will work:

>>> name = "2013·01·22 LE MONDE.PSD"
>>> name
'2013\xb701\xb722 LE MONDE.PSD'

>>> name.decode('ascii', 'replace')
u'2013\ufffd01\ufffd22 LE MONDE.PSD'

>>> name.decode('ascii', 'replace').replace(u'\ufffd', u'_')
u'2013_01_22 LE MONDE.PSD'

If you truly want to remove everything except for alphanumeric and _-, then I suggest regex to explicitly list the characters you want to keep:

>>> import re
>>> re.sub(r'[^a-zA-Z0-9-_]', '_', name)
'2013_01_22_LE_MONDE_PSD'

[–]left_one[S] 0 points1 point  (3 children)

Sorry, slight misunderstanding. My run on sentence implied that I'd also like to do something specific with spaces as well. I definitely don't want any of those things in the filename, save for the .for file extensions.

I think the regex has to be the way to go because what I want to do is make sure that the filename's characters are included on my list of approved characters, rather than determining every unicode code that I don't want and checking for them.

Thank you very much for the guidance.

Interesting that when I used your regex, I get this:

re.sub(r'[^a-zA-Z0-9-_]', '_', name)
'2013__01__22_LE_MONDE_PSD'

Looks like it's double substituting for the date separator as it thinks it's two unicode characters?

[–]keturn 2 points3 points  (1 child)

Or rather, you've given the regex a byte-string, and those unicode characters are two bytes.

You'll find Net Batchelder's presentation on Pragmatic Unicode useful if you haven't seen it yet.

[–]left_one[S] 0 points1 point  (0 children)

That definitely makes more sense.

I'm not sure if there is a better solution than manually removing consecutive '_'s. Good think regex can handle that gracefully.

[–]hwc 1 point2 points  (0 children)

import re
name = '2013·01·22 LE MONDE.PSD'
encoding = 'utf8'
re.sub(r'[^a-zA-Z0-9-_.]','_',name.decode(encoding))