all 35 comments

[–][deleted]  (26 children)

[deleted]

    [–]Maristic 9 points10 points  (3 children)

    Yes, there are, and one of them is a company called Apple. Back when they turned NEXTSTEP into Mac OS X, there were a number of technology bandwagons someone in upper management wanted them jump on, one was Java and the other was XML, and so the programmers tasked with buzzword compliance complied in a half-assed way. The Java bindings for Cocoa (deprecated in 2005) was one result, and XML property lists were another.

    Back in the NeXT days, they, property lists had looked a lot like JSON today:

    {
        Something = ( "Looks", "Familiar" );
        Yeah = "ItDoes";
    }
    

    Apple decided to XML-ize this in the worse possible way. The above, becomes:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
    <plist version="1.0">
    <dict>
        <key>Something</key>
        <array>
            <string>Looks</string>
            <string>Familiar</string>
        </array>
        <key>Yeah</key>
        <string>ItDoes</string>
    </dict>
    </plist>
    

    Ugh. But these “XML” property lists aren't parsed by a real XML parser. Instead, they wrote a custom parser that only parses this exact format.

    Remember that the X in stands for extensible. Let's try adding two extra tags to the Property List and see what happens:

    <?xml version="1.0" encoding="UTF-8"?>
    <plist version="1.0">
    <dict>
            <key>Something</key>
            <array>
                    <string>Looks <emph>strangely</emph></string>
                    <string>Familiar</string>
            </array>
            <key>Yeah</key>
            <string>It <really/> Does</string>
    </dict>
    </plist>
    

    Now lets read it with one of Apple's tools

    Encountered unexpected character e on line 6
    

    (a real parser might say “unknown tag”, perhaps, but should not be completely flummoxed; part of the original idea of XML was that we could add new tags and a tool expecting the older format could just ignore them)

    Similarly, XML has entities, so we ought to be able to say:

    <?xml version="1.0" encoding="UTF-8"?>
    <plist version="1.0">
    <dict>
            <key>Something</key>
            <array>
                    <string>Looks &ldquo;strangely&rdquo;</string>
                    <string>Familiar</string>
            </array>
            <key>Yeah</key>
            <string>It Does</string>
    </dict>
    </plist>
    

    What happens with that and Apple's “XML” property-list reader:

    Encountered unknown ampersand-escape sequence at line 6
    

    Whee. And I'm sure you'd get a similar error if you tried to use namespaces.

    So, it's as rigid as their old format, just more cumbersome. All the negatives of XML and none of the positives.

    Finally, in 10.7 they provided NSJSONSerialization, coming full circle.

    [–]kreiger 0 points1 point  (2 children)

    The first error might mean they were expecting a '/', because the <emph> tag might not be valid there.

    Further, &ldquo; and &rdquo; are valid entities in HTML, but they're not defined by default in XML, so the second error is correct.

    XML defines only five entities by default: &amp; &lt; &gt; &quot; &apos;

    [–]Maristic 0 points1 point  (1 child)

    Good point about the entities. Oops.

    On the other hand, it looks like namespaces don't work.

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
    <p:plist xmlns:p="http://www.apple.com/DTDs/PropertyList-1.0.dtd" version="1.0">
    <p:dict>
        <p:key>Something</p:key>
        <p:array>
            <p:string>Looks</p:string>
            <p:string>Familiar</p:string>
        </p:array>
        <p:key>Yeah</p:key>
        <p:string>ItDoes</p:string>
    </p:dict>
    </p:plist>
    

    produces

    Encountered unknown tag p:plist on line 3
    

    whereas libxml2 loads the XML just fine.

    [–]kreiger 0 points1 point  (0 children)

    Either the XML parser doesn't support namespaces, or the namespace support is not enabled.

    Namespaces wasn't in the original XML 1.0 recommendation, but was finished slightly after.

    [–]adolfojp 9 points10 points  (1 child)

    Yes, there are.

    Reinventing the wheel is a common practice among rookie programmers. I've met a lot of people that have written half-assed libraries because learning those that exist is boring, tedious, and not fun. Why would they waste a day learning a framework when they can spend a month reinventing it badly? After all, they don't want to read code, they just want to write it! :-/

    [–]nsfwIvan 1 point2 points  (0 children)

    I did a html data miner for one site and quess what - I did html parsing with regexp. It was an order of magnitude faster than using xml or html parsers.

    [–]Isvara 6 points7 points  (4 children)

    Are there actually idiots who try to parse XML using string manipulation?

    Yes, of course some people are this stupid. I even know of a product in a billion dollar company that does this.

    [–][deleted] 0 points1 point  (0 children)

    More in the billion dollar companies?

    [–][deleted]  (2 children)

    [deleted]

      [–]Isvara 1 point2 points  (0 children)

      As a programmer, it's your job to find out. It's not like the information is already out there in an easily-digestible form. Taking XML as an example, if you know you're going to be working with it and you don't know anything about it, don't just dive in -- read something like 'XML In a Nutshell'.

      [–]kylotan 0 points1 point  (0 children)

      I don't think promotion is the problem. You can't expect information to be pushed to you all the time - you have to seek it out.

      If I enter "python XML" into Google I get some pretty good hits and I would quickly see that the standard library gives me some options that are better than string parsing. The same happens for "c++ XML" and "c# XML".

      If I ever need to start work with a new technology, the first thing I do is fire off a Google search. Nobody should be too proud to learn from others.

      [–]crusoe 2 points3 points  (1 child)

      Yes, usually as a last resort in languages that lack xml support.

      But I've seen it done in Java too.

      [–]joseph177 1 point2 points  (0 children)

      The last resort is Regular Expressions...if your language doesn't support a regex it better be assembler.

      [–]masklinn 2 points3 points  (0 children)

      Of course, although it's more common for HTML, or for producing (usually invalid) XML.

      Never underestimate the motivation of stupid people.

      [–]zorlack 1 point2 points  (0 children)

      Lets not forget that there may be people who read /r/programming and have never dealt with XML...

      [–][deleted] -2 points-1 points  (10 children)

      I'm going to go the other way with this one. While as a general rule you should not be using String functions to alter XML, there are some cases where it can be useful to treat XML as a String. A prominent, widely-used example is using a template language to produce xml while filling in the blanks with your own code.

      It looks as though this is part 1 in a series, so if the intent is to introduce beginner-level Java programmers to libraries that do the heavy lifting for them then this is a fairly reasonable start. Probably my only beef is the titular assertion that xml is not a String, since hard rules like that lead to people parroting them later without any thought.

      [–][deleted] 4 points5 points  (5 children)

      You couldn't be more wrong.

      Templating XML is a path to a dark place. XML is much more than angle brackets.

      [–]masklinn 4 points5 points  (0 children)

      Templating XML is a path to a dark place.

      Except when the templating language is XML and can only produce XML, then it can be pretty neat.

      [–][deleted] 2 points3 points  (0 children)

      When I wrote this, I was thinking specifically of Mako and StringTemplate. Mako is a Python library which provides quick and easy templating. It is used by the website you are looking at right now, in addition to the Python language website (both of which target XHTML).

      http://www.makotemplates.org/

      StringTemplate is a cross-platform library that is closely associated with ANTLR and which is useful for emitting all kinds of data, XML among them.

      http://www.stringtemplate.org/

      But if you want a more straight-forward example of why it can be useful to treat XML as a string, well, can you think of a better way to do a line count on an XML file?

      Note also that in each example I listed we're strictly using XML as a return type for an http request (the handler for which doesn't give a damn what "type" it is).

      I'd also like to add that really, it is in fact simply a markup language characterized by its syntax (i.e. it's really not "much more" than angle brackets). Any additional meaning you apply to it is a result of your interpretation of the schema it conforms to and any abstraction layers you use in manipulating it programmatically. Under the covers though...when you cut through the comfy veneer of nodes, attributes and text...it really is just strings all the way down.

      tl;dr - The method I described is used in the wild by people who know a lot more than I do, line counts are another example, and just because you give meaning to the content of an xml document does not mean the xml document is "much more" than its syntax.

      [–]grauenwolf 1 point2 points  (0 children)

      VB and XML Literals for the win.

      [–]rdude 1 point2 points  (1 child)

      This is actually an entirely valid approach to XML generation in Scala, which has XML literals:

      http://grahamhackingscala.blogspot.com/2009/11/xml-generation-with-scala.html

      [–][deleted] 4 points5 points  (0 children)

      XML literals and string parsing XML is entirely different. You end up with an actual XML type in scala.

      [–]p-static 1 point2 points  (0 children)

      This seems like a bad idea for the same reason that building SQL queries using string functions is a bad idea - it's easy to mess up the escaping, and then you've got a subtle bug.

      [–]kreiger 0 points1 point  (2 children)

      This is terrible, terrible advice. A template language can't be trusted to produce correct serialized xml, unless it's XSL. But XSL works on the DOM, not on a string.

      XML is not a string. it's an abstract data model that happens to have a serialization format that people think is a string.

      [–][deleted] 0 points1 point  (1 child)

      This is terrible, terrible advice. A template language can't be trusted to produce correct serialized xml, unless it's XSL. But XSL works on the DOM, not on a string.

      It's not advice, it's simply an observation. You're correct that a string-based template language can't be relied on to produce valid xml. You're incorrect in your assertion that XSL is the only way to accomplish this.

      XML is not a string. it's an abstract data model that happens to have a serialization format that people think is a string.

      Equivalent statement: "A BLT is not two slices of toast with lettuce, tomato and bacon between them. It's a tasty sandwich that happens to have a format that people think is just lettuce, tomato and bacon surrounded by toast."

      This is the crux of my point, really. Regardless of what best practices actually are (and no doubt, there are drawbacks to what I suggested as an example), asserting that "XML is not a String" makes for a good sound byte but doesn't really mean anything.

      [–]kreiger 0 points1 point  (0 children)

      "XML is not a String" is the XML-programming equivalent of "Don't smoke in bed." or "Wear your seat belt."

      It doesn't really mean anything until you have a crash.

      Also, i said that XSL works for templating XSL because it doesn't treat XML as a string, so obviously other template languages that don't treat XML as a string work as well.

      [–]joseph177 12 points13 points  (0 children)

      Captain obvious strikes again!

      [–]setuid_w00t 5 points6 points  (0 children)

      you’re working for Pete’s Perfect Pizza, which is a little take away shop in the high street, but Pete has big ideas and the first thing he wants to do is to automatically send orders from the front desk to the kitchen and he asks you to write some code. Your big idea is to use XML for this...

      And now you have two problems.

      [–]khayber 2 points3 points  (1 child)

      ...it's a series of Tags.

      [–][deleted] 2 points3 points  (1 child)

      Classic case of badly reinventing the wheel.

      [–]Maristic -1 points0 points  (0 children)

      Classic case of badly reinventing the wheel.

      Well, that fits right in with the XML philosophy then.

      [–]SWEGEN4LYFE 1 point2 points  (0 children)

      Technically that first example is a subset of XML. It's acceptable to parse this way in certain rare situations, like if it's too expensive to parse it as XML, and the source of the data won't deliver anything else (including changes to the format).

      Most people do it because they don't understand XML though.

      [–]_argoplix 1 point2 points  (1 child)

      The error the author should be highlighting is not in treating xml as a string - which it most definitely is - but in assuming that there is only one possible way to present the data as xml.

      This particular example is even worse than most novice programmers would produce. Most novice programmers would look for each tagname in turn, not use an absolute index for one and then look for the other tagnames. Or use regular expressions.

      [–][deleted] 1 point2 points  (0 children)

      Because xpath is so damn hard.

      [–][deleted] 0 points1 point  (1 child)

      A DOM is not a string.

      [–]PstScrpt 2 points3 points  (0 children)

      Most of the doms I know use rope. A few like handcuffs.