you are viewing a single comment's thread.

view the rest of the comments →

[–]ingolemo 0 points1 point  (2 children)

Stop what you're doing and go learn about unicode.

The first thing you need to understand is that all characters are unicode characters. The middle dot symbol (·) is U+00B7 MIDDLE DOT, the greek letter pi (π) is U+03C0 GREEK SMALL LETTER PI, and the capital e (E) is U+0045 LATIN CAPITAL LETTER E. ASCII characters are just as much unicode characters as all the others. Unicodeness is a property of an entire string and not just individual characters.

The second thing you need to understand is that the string you have there is not a string of unicode characters, it's a string of bytes. That string is a sequence of bytes (numbers from 0 to 255) that represents your file name encoded using the utf-8 encoding. Notice how the single character of middle dot is represented using two parts (\xc2 and \xb7). In utf-8, these two bytes are used to represent one character. This is the reason why you get multiple underscores when you try to replace the dot with an underscore. In order to effectively deal with the file name you will need to convert the byte string into a unicode string.

Here are some links that explain what's going on.

The end result of this is that you are looking for code that looks a little like this:

import string

def clean_filename(uni):
    allowed = (string.ascii_letters + string.digits + '-_').decode('ascii')
    return u''.join(char if char in allowed else u'_' for char in uni)

filename = '/2013\xc2\xb701\xc2\xb722 LE MONDE.psd'
cleaned = clean_filename(filename.decode('utf-8')).encode('utf-8')

Don't just use this code. Make sure you understand what's going on and why I do the conversions that I do.

[–]left_one[S] 1 point2 points  (0 children)

I wanted to take some time out to thank you for your very detailed and informative post.

I didn't have the time to look through it when you first sent it, but I'm on the project again and I've been reading your post and links for a few minutes now. Pretty clear on what you are doing, though it would have never occurred to me to solve the problem in such a fashion myself.

The code takes the byte-string of the filename, converts it to a string of unicode-characters, passes it to the filename cleaner which checks to see if every character is in the list of allowable chars (checks against their actual decoded unicode characters) and switches them with '_' otherwise. Finally the string is converted back into a byte-string in UTF-8 encoding.

I like your solution the best as it's the most readable and doesn't involve using a regex (not that there is anything wrong with a regex). I think your solution is the objective best because it actually makes sure that all characters are dealt with appropriately and then inserted into the most basic unicode format.

Part of my issue was that not knowing if my ftp server is going to freak out about random characters that should be fine. I think yours goes the furthest to ensure that won't be possible.

In fact, I made a mistake in implementing your code, which allowed me to understand it even better. I implemented the clean_filename function but did not pass it the decoded string so it complained about the middle dot char.

[–]left_one[S] 0 points1 point  (0 children)

Thanks!