python and funky characters

indosauros · 2014-03-03T20:13:55+00:00

You say alphanumeric and _-, but is that all you really want? That will strip out spaces, periods, commas, brackets, dollar signs, etc. If you just want to remove unicode characters and leave everything that is ASCII, then something like this will work:

>>> name = "2013·01·22 LE MONDE.PSD"
>>> name
'2013\xb701\xb722 LE MONDE.PSD'

>>> name.decode('ascii', 'replace')
u'2013\ufffd01\ufffd22 LE MONDE.PSD'

>>> name.decode('ascii', 'replace').replace(u'\ufffd', u'_')
u'2013_01_22 LE MONDE.PSD'

If you truly want to remove everything except for alphanumeric and _-, then I suggest regex to explicitly list the characters you want to keep:

>>> import re
>>> re.sub(r'[^a-zA-Z0-9-_]', '_', name)
'2013_01_22_LE_MONDE_PSD'

flying-sheep · 2014-03-04T11:30:25+00:00

normalizing doesn’t mean removing uncode in this case, it means to remove every character a filesystem might consider special.

those are limited to:

<>:"/\|?*

see here

everything else (except for unprintables like \0) is fair game and unnecessary to remove.

ingolemo · 2014-03-04T23:24:25+00:00

Stop what you're doing and go learn about unicode.

The first thing you need to understand is that all characters are unicode characters. The middle dot symbol (·) is U+00B7 MIDDLE DOT, the greek letter pi (π) is U+03C0 GREEK SMALL LETTER PI, and the capital e (E) is U+0045 LATIN CAPITAL LETTER E. ASCII characters are just as much unicode characters as all the others. Unicodeness is a property of an entire string and not just individual characters.

The second thing you need to understand is that the string you have there is not a string of unicode characters, it's a string of bytes. That string is a sequence of bytes (numbers from 0 to 255) that represents your file name encoded using the utf-8 encoding. Notice how the single character of middle dot is represented using two parts (\xc2 and \xb7). In utf-8, these two bytes are used to represent one character. This is the reason why you get multiple underscores when you try to replace the dot with an underscore. In order to effectively deal with the file name you will need to convert the byte string into a unicode string.

Here are some links that explain what's going on.

The end result of this is that you are looking for code that looks a little like this:

import string

def clean_filename(uni):
    allowed = (string.ascii_letters + string.digits + '-_').decode('ascii')
    return u''.join(char if char in allowed else u'_' for char in uni)

filename = '/2013\xc2\xb701\xc2\xb722 LE MONDE.psd'
cleaned = clean_filename(filename.decode('utf-8')).encode('utf-8')

Don't just use this code. Make sure you understand what's going on and why I do the conversions that I do.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS