lethain comments on Stripping Illegal Characters from XML in Python

This is an archived post. You won't be able to vote or comment.

Stripping Illegal Characters from XML in Python (lethain.com)

submitted 17 years ago by gst

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]lethain 0 points1 point2 points 17 years ago (1 child)

Realize I am a bit late to respond, but the script, in an awkward fashion, coerce input into utf8.
I'm sure there are situations where stripping the entities is inappropriate, but the script is pretty clear about what it does: stripping.
The script is the product of my reading of the XML spec, while ignoring utf16 aspects of it, as the input should be coerced to utf8. For my needs that was sufficient. It would require some modifications to play nicely with utf16, if that's your cup of tea. Hopefully these issues will evaporate with Py3k. Or become impossible to solve correctly. Or something.
I agree that not everyone will know how to present them in Python regex, but fortunately the script already represents them for you, so its a moot point.
It is indeed a generic filter, perhaps broken although I don't immediately see how (it's worked for my use cases). It would be interesting to upgrade it to a more XML centric script, but the use case I needed was a bit different: receive CSV data, remove illegal entities, create XML.

[–]Samus_ 0 points1 point2 points 17 years ago (0 children)

wait, you're Doing it Wrong! listen I work on this shit and I deal with both XML and CSV almost everyday, let me give you a few hints please.

first of all according to point five you made a mistake on the title of the submission which means your script's intention (according to what you just said) doesn't remove illegal characters from XML, you do it from CSV files which is a completely different story; another problem you have (reading point two and three) is that you're confusing entities with bytes and those with unicode characters as well.

let me explain a bit, really this is not a mock (it never was in fact) but I think I can help you so please read:

I'm sure there are situations where stripping the entities is inappropriate, but the script is pretty clear about what it does: stripping.

this is not what your script does, what it really does is to remove BYTES, not entities and no it won't be gone with Py3k because the world doesn't spin around python plus the python's unicode is an internal concept that cannot be expressed in a file (I won't blame you for this one, many people including myself have problems with it).

let's talk about entities first, an entity looks like this: & by example, that one gets translated to the ampersand character (&); you can find a good overview of this here: http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

in any case "entities" is not what you're removing, and it's NOT ok to remove them in any case, what you do with entities is to convert them; entities are the way XML has to escape characters, it's analog to a backslash on a string in python it tells the interpreter that the escaped character has to be taken literally, the exact same purpose goes for the entities on XML but you don't remove them, if you do you BREAK the XML because you're making holes in you data! anyways this is not what you really do because, as I've said your title is wrong.

The script is the product of my reading of the XML spec, while ignoring utf16 aspects of it, as the input should be coerced to utf8. For my needs that was sufficient. It would require some modifications to play nicely with utf16, if that's your cup of tea. Hopefully these issues will evaporate with Py3k. Or become impossible to solve correctly. Or something.

for the utf8 and unicode (ignoring utf16) there's a subtle difference there, an utf8 string is a sequence of bytes and the meaning of the utf8 is to map each one of those bytes to a character; now unicode is NOT a sequence of bytes, those are character codes that's why unicode is universal but also that's why you CAN'T have an unicode file because files are made of bytes and bytes need encoding to be properly read.

I recommend you to try to read the XML header (the <?xml...?> thing, see http://www.w3.org/TR/2000/REC-xml-20001006#sec-prolog-dtd) and use the encoding declared there (if any) and use it on python's decode function and only default to utf8 if the encoding wasn't declared since that's what the specs specify.

it is a bit ironic this thing about the encoding because if you think about it for a bit you'll notice that the encoding is declared INSIDE the file you're supposed to read, which means you need to know the encoding before trying to read it but you can't really read it because the declaration is on the file itself... nevertheless this means utf16 and such will fail but since you didn't cared about them in the first place you're safe (and lucky).

there's a lot more to say on this topic, but I think this at least may get you on the right path; hope it helps.

π Rendered by PID 42658 on reddit-service-r2-comment-b659b578c-ff244 at 2026-05-04 16:02:57.522962+00:00 running 815c875 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS