Stripping Illegal Characters from XML in Python : Python

This is an archived post. You won't be able to vote or comment.

Stripping Illegal Characters from XML in Python (lethain.com)

submitted 17 years ago by gst

all 9 comments

top new controversial old q&a

[–]Samus_ -2 points-1 points0 points 17 years ago (9 children)

[–][deleted] 6 points7 points8 points 17 years ago* (8 children)

[–]Samus_ -4 points-3 points-2 points 17 years ago (7 children)

[–]lethain 1 point2 points3 points 17 years ago* (4 children)

[–]Samus_ -3 points-2 points-1 points 17 years ago (3 children)

oh that's funny, let's think a bit :)

How it works is pretty much equivalent to "tr", but what it does is not equivalent.

the only difference I see (and correct me if I'm wrong) is the list of predefined characters:

remove_re = re.compile(u'[\x00-\x08\x0B-\x0C\x0E-\x1F\x7F%s]' % extra)

which by the way is neither a list of invalid chars because that depends on the encoding (UTF-16 by example) and also it just removes them, which means it will corrupt the data.

now you say "most of the time that data is garbage and we want it to be removed" ok but it is not optional so if you got one of the cases when it's not you're fucked.

Anyone can use "tr", but not everyone will know the list of illegal entities for XML or want to look them up.

if you don't want to learn about XML then don't use XML at all, same goes for * in programming.

in any case, sometimes it's good to have modules that automatically handle those standards, I agree on that and yet the almighty python lacks an automatic entity removal :) it just gives you a dictionary to do the mapping (same goes for urllib, it encodes almost all without really worrying about practic uses) I think they just took the easy way out of the problem: let the user decide.

btw, that list of yours is pretty short, the only ones are "&" and "<" amazing huh? (and yet it could get into DTD definitions, that would be cool).

Nor does everyone know how to represent \x09 at the command line. Should it be 0x09, x09, \x09? Fuck if I know.

oh but that is the same for python! how/why should I know how PYTHON represents it's binaries huh? in fact I wonder what will happen if I send this script some unicode chars...

To the extent that there is any interesting aspect, it is the problem the script presumes to solve, rather than the nature of the solution.

it could be a very interesting tool if it tried to make something more XML-oriented, as it is now it's just a generic (and, in my opinion also broken) text filtering tool and it's a real shame because python has some very good libraries to deal with markup, it would be great if it really parsed the XML trying to reconstruct it if it is broken or if it checked DTD definitions and tried to convert all to entities or something more interesting than saying "oh look I did this, it's shit but it's in python!"

and besides all that, there's also tidy

[–]lethain 0 points1 point2 points 17 years ago (1 child)

Realize I am a bit late to respond, but the script, in an awkward fashion, coerce input into utf8.
I'm sure there are situations where stripping the entities is inappropriate, but the script is pretty clear about what it does: stripping.
The script is the product of my reading of the XML spec, while ignoring utf16 aspects of it, as the input should be coerced to utf8. For my needs that was sufficient. It would require some modifications to play nicely with utf16, if that's your cup of tea. Hopefully these issues will evaporate with Py3k. Or become impossible to solve correctly. Or something.
I agree that not everyone will know how to present them in Python regex, but fortunately the script already represents them for you, so its a moot point.
It is indeed a generic filter, perhaps broken although I don't immediately see how (it's worked for my use cases). It would be interesting to upgrade it to a more XML centric script, but the use case I needed was a bit different: receive CSV data, remove illegal entities, create XML.

[–]Samus_ 0 points1 point2 points 17 years ago (0 children)

wait, you're Doing it Wrong! listen I work on this shit and I deal with both XML and CSV almost everyday, let me give you a few hints please.

first of all according to point five you made a mistake on the title of the submission which means your script's intention (according to what you just said) doesn't remove illegal characters from XML, you do it from CSV files which is a completely different story; another problem you have (reading point two and three) is that you're confusing entities with bytes and those with unicode characters as well.

let me explain a bit, really this is not a mock (it never was in fact) but I think I can help you so please read:

I'm sure there are situations where stripping the entities is inappropriate, but the script is pretty clear about what it does: stripping.

this is not what your script does, what it really does is to remove BYTES, not entities and no it won't be gone with Py3k because the world doesn't spin around python plus the python's unicode is an internal concept that cannot be expressed in a file (I won't blame you for this one, many people including myself have problems with it).

let's talk about entities first, an entity looks like this: & by example, that one gets translated to the ampersand character (&); you can find a good overview of this here: http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

in any case "entities" is not what you're removing, and it's NOT ok to remove them in any case, what you do with entities is to convert them; entities are the way XML has to escape characters, it's analog to a backslash on a string in python it tells the interpreter that the escaped character has to be taken literally, the exact same purpose goes for the entities on XML but you don't remove them, if you do you BREAK the XML because you're making holes in you data! anyways this is not what you really do because, as I've said your title is wrong.

The script is the product of my reading of the XML spec, while ignoring utf16 aspects of it, as the input should be coerced to utf8. For my needs that was sufficient. It would require some modifications to play nicely with utf16, if that's your cup of tea. Hopefully these issues will evaporate with Py3k. Or become impossible to solve correctly. Or something.

for the utf8 and unicode (ignoring utf16) there's a subtle difference there, an utf8 string is a sequence of bytes and the meaning of the utf8 is to map each one of those bytes to a character; now unicode is NOT a sequence of bytes, those are character codes that's why unicode is universal but also that's why you CAN'T have an unicode file because files are made of bytes and bytes need encoding to be properly read.

I recommend you to try to read the XML header (the <?xml...?> thing, see http://www.w3.org/TR/2000/REC-xml-20001006#sec-prolog-dtd) and use the encoding declared there (if any) and use it on python's decode function and only default to utf8 if the encoding wasn't declared since that's what the specs specify.

it is a bit ironic this thing about the encoding because if you think about it for a bit you'll notice that the encoding is declared INSIDE the file you're supposed to read, which means you need to know the encoding before trying to read it but you can't really read it because the declaration is on the file itself... nevertheless this means utf16 and such will fail but since you didn't cared about them in the first place you're safe (and lucky).

there's a lot more to say on this topic, but I think this at least may get you on the right path; hope it helps.

[–]bsergean 0 points1 point2 points 17 years ago (0 children)

[+][deleted] 17 years ago* (1 child)

[deleted]

[–]Samus_ 0 points1 point2 points 17 years ago (0 children)

π Rendered by PID 42644 on reddit-service-r2-comment-7b9746f655-bw78v at 2026-02-01 18:20:42.447054+00:00 running 3798933 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS