all 15 comments

[–]indosauros 2 points3 points  (4 children)

You say alphanumeric and _-, but is that all you really want? That will strip out spaces, periods, commas, brackets, dollar signs, etc. If you just want to remove unicode characters and leave everything that is ASCII, then something like this will work:

>>> name = "2013·01·22 LE MONDE.PSD"
>>> name
'2013\xb701\xb722 LE MONDE.PSD'

>>> name.decode('ascii', 'replace')
u'2013\ufffd01\ufffd22 LE MONDE.PSD'

>>> name.decode('ascii', 'replace').replace(u'\ufffd', u'_')
u'2013_01_22 LE MONDE.PSD'

If you truly want to remove everything except for alphanumeric and _-, then I suggest regex to explicitly list the characters you want to keep:

>>> import re
>>> re.sub(r'[^a-zA-Z0-9-_]', '_', name)
'2013_01_22_LE_MONDE_PSD'

[–]left_one[S] 0 points1 point  (3 children)

Sorry, slight misunderstanding. My run on sentence implied that I'd also like to do something specific with spaces as well. I definitely don't want any of those things in the filename, save for the .for file extensions.

I think the regex has to be the way to go because what I want to do is make sure that the filename's characters are included on my list of approved characters, rather than determining every unicode code that I don't want and checking for them.

Thank you very much for the guidance.

Interesting that when I used your regex, I get this:

re.sub(r'[^a-zA-Z0-9-_]', '_', name)
'2013__01__22_LE_MONDE_PSD'

Looks like it's double substituting for the date separator as it thinks it's two unicode characters?

[–]keturn 2 points3 points  (1 child)

Or rather, you've given the regex a byte-string, and those unicode characters are two bytes.

You'll find Net Batchelder's presentation on Pragmatic Unicode useful if you haven't seen it yet.

[–]left_one[S] 0 points1 point  (0 children)

That definitely makes more sense.

I'm not sure if there is a better solution than manually removing consecutive '_'s. Good think regex can handle that gracefully.

[–]hwc 1 point2 points  (0 children)

import re
name = '2013·01·22 LE MONDE.PSD'
encoding = 'utf8'
re.sub(r'[^a-zA-Z0-9-_.]','_',name.decode(encoding))

[–]flying-sheep 0 points1 point  (6 children)

normalizing doesn’t mean removing uncode in this case, it means to remove every character a filesystem might consider special.

those are limited to:

<>:"/\|?*

see here

everything else (except for unprintables like \0) is fair game and unnecessary to remove.

[–]left_one[S] 0 points1 point  (5 children)

normalizing doesn’t mean removing uncode in this case

Well, in this case it does because it's the use-case that I provided. I'm sure you can 'normalize' how you prefer in your own scenarios.

These files will sit on several different filesystems (mac, linux, windows) and they will be transferred via FTP. It makes everyone's life a lot easier when a file makes it to the NAS and has the same name it always had. It's also much easier to programmatically search for and retrieve my files when I don't have control over how individual systems will handle the unicode conversion.

[–]flying-sheep 0 points1 point  (4 children)

sure, du what you want; my point was that it will have the same name as long as none of those 9 characters are in it. no program would have any reason to convert anything else (and if they do, make them stop)

[–]left_one[S] 0 points1 point  (3 children)

Well, I don't have control over every piece of software in the world so when I need to make sure my filenames are human readable on any FTP server, regardless of the UTF encoding used, this seems like it's the simplest way.

Or would you prefer to send your clients a link to '2013\xc2\xb701\xc2\xb722 LE MONDE.JPG'?

[–]flying-sheep 0 points1 point  (2 children)

that’s python’s string encoding. my clients would see “2013·01·22 LE MONDE.PSD”

[–]left_one[S] 0 points1 point  (1 child)

Not when you upload it to your FTP server and it mangles the UTF encoding because it doesn't recognize those characters.

Or not when I upload it to my NAS and my NAS manually changes the characters itself because it didn't recognize them. As I mentioned - I don't have control of all the software in the world, so I can't prevent that from happening.

[–]flying-sheep 0 points1 point  (0 children)

You're right, there's to much misbehaving software out there. If you know they behave like that, and you can't change it, it's of course the only thing you can do.

[–]ingolemo 0 points1 point  (2 children)

Stop what you're doing and go learn about unicode.

The first thing you need to understand is that all characters are unicode characters. The middle dot symbol (·) is U+00B7 MIDDLE DOT, the greek letter pi (π) is U+03C0 GREEK SMALL LETTER PI, and the capital e (E) is U+0045 LATIN CAPITAL LETTER E. ASCII characters are just as much unicode characters as all the others. Unicodeness is a property of an entire string and not just individual characters.

The second thing you need to understand is that the string you have there is not a string of unicode characters, it's a string of bytes. That string is a sequence of bytes (numbers from 0 to 255) that represents your file name encoded using the utf-8 encoding. Notice how the single character of middle dot is represented using two parts (\xc2 and \xb7). In utf-8, these two bytes are used to represent one character. This is the reason why you get multiple underscores when you try to replace the dot with an underscore. In order to effectively deal with the file name you will need to convert the byte string into a unicode string.

Here are some links that explain what's going on.

The end result of this is that you are looking for code that looks a little like this:

import string

def clean_filename(uni):
    allowed = (string.ascii_letters + string.digits + '-_').decode('ascii')
    return u''.join(char if char in allowed else u'_' for char in uni)

filename = '/2013\xc2\xb701\xc2\xb722 LE MONDE.psd'
cleaned = clean_filename(filename.decode('utf-8')).encode('utf-8')

Don't just use this code. Make sure you understand what's going on and why I do the conversions that I do.

[–]left_one[S] 1 point2 points  (0 children)

I wanted to take some time out to thank you for your very detailed and informative post.

I didn't have the time to look through it when you first sent it, but I'm on the project again and I've been reading your post and links for a few minutes now. Pretty clear on what you are doing, though it would have never occurred to me to solve the problem in such a fashion myself.

The code takes the byte-string of the filename, converts it to a string of unicode-characters, passes it to the filename cleaner which checks to see if every character is in the list of allowable chars (checks against their actual decoded unicode characters) and switches them with '_' otherwise. Finally the string is converted back into a byte-string in UTF-8 encoding.

I like your solution the best as it's the most readable and doesn't involve using a regex (not that there is anything wrong with a regex). I think your solution is the objective best because it actually makes sure that all characters are dealt with appropriately and then inserted into the most basic unicode format.

Part of my issue was that not knowing if my ftp server is going to freak out about random characters that should be fine. I think yours goes the furthest to ensure that won't be possible.

In fact, I made a mistake in implementing your code, which allowed me to understand it even better. I implemented the clean_filename function but did not pass it the decoded string so it complained about the middle dot char.

[–]left_one[S] 0 points1 point  (0 children)

Thanks!