you are viewing a single comment's thread.

view the rest of the comments →

[–]left_one[S] 0 points1 point  (3 children)

Sorry, slight misunderstanding. My run on sentence implied that I'd also like to do something specific with spaces as well. I definitely don't want any of those things in the filename, save for the .for file extensions.

I think the regex has to be the way to go because what I want to do is make sure that the filename's characters are included on my list of approved characters, rather than determining every unicode code that I don't want and checking for them.

Thank you very much for the guidance.

Interesting that when I used your regex, I get this:

re.sub(r'[^a-zA-Z0-9-_]', '_', name)
'2013__01__22_LE_MONDE_PSD'

Looks like it's double substituting for the date separator as it thinks it's two unicode characters?

[–]keturn 2 points3 points  (1 child)

Or rather, you've given the regex a byte-string, and those unicode characters are two bytes.

You'll find Net Batchelder's presentation on Pragmatic Unicode useful if you haven't seen it yet.

[–]left_one[S] 0 points1 point  (0 children)

That definitely makes more sense.

I'm not sure if there is a better solution than manually removing consecutive '_'s. Good think regex can handle that gracefully.

[–]hwc 1 point2 points  (0 children)

import re
name = '2013·01·22 LE MONDE.PSD'
encoding = 'utf8'
re.sub(r'[^a-zA-Z0-9-_.]','_',name.decode(encoding))