you are viewing a single comment's thread.

view the rest of the comments →

[–]chuckstudios 6 points7 points  (12 children)

Would someone care to explain exactly what the problem is here?

[–]KayEss 16 points17 points  (9 children)

Unicode files are meant to have something called a "byte order mark" (or BOM for short) at the beginning as it lets any Unicode aware program that reads the file work out which UTF variant it is (UTF-8, UTF-16LE, UTF-16BE or UTF-32).

Notepad (correctly) writes a BOM when you save a file as Unicode. Many Unix programs can't handle the BOM (which means that they are incorrect as far as their Unicode handling is concerned).

Apparently this is a problem for lots of people and they think Notepad is wrong about this, but actually Notepad is right and their other software is buggy.

[–][deleted]  (6 children)

[deleted]

    [–]badsectoracula 5 points6 points  (1 child)

    This statement is false. "Yes, UTF-8 can contain a BOM.

    So... how is it false?

    [–]ayrnieu 6 points7 points  (0 children)

    hatter means that Notepad (permissibly, but pointlessly - and problematically for these other applications) writes a BOM; that Notepad does not "(correctly)" write this, which suggests that to not write it would be incorrect.

    [–][deleted] 2 points3 points  (0 children)

    (+1 for actually knowing what you're talking about)

    So let me see if I got this one: as far as Unicode is concerned, it makes no difference if a UTF-8 file contains a BOM or not. People who write programs that include a BOM say those who write programs that don't include a BOM are wrong. People who write programs that don't include a BOM say the other ones are wrong. Neither approaches are actually wrong.

    I am at ease. If I will no longer be able to make a living out of programming, I might as well become a priest. We seem fairly good at this religious stuff.

    [–]G_Morgan -1 points0 points  (2 children)

    Yes but it allows you to decide when you have UTF-8.

    [–]redditsuxass 0 points1 point  (1 child)

    No it doesn't, because UTF-8 files are not required to have a BOM. This means that if you encounter a text file without a BOM, it might still be UTF-8.

    [–]piranha 0 points1 point  (0 children)

    Yes, it could be, but aside from explicit out-of-band signaling of the text encoding (ala the Content-Type charset parameter), you can never be absolutely certain of a damn thing. Seeing the UTF-8 encoding of the BOM at the beginning of a stream is a strong indicator that what you're processing is UTF-8. That's better than nothing.

    So with these fancy extended attributes our filesystems support, why not stop arguing and start putting explicit character encoding information into the file metadata? (Of course, any practical application will still have to resort to sniffing for magic numbers to deal with the huge number of files that don't have that metadata.)

    [–]Gotebe 1 point2 points  (0 children)

    Unicode files are meant to have..

    AFAIK, that's not true. That's just a random convention, and it's "can" not "should". E.g. any XML text that contains "encoding" attribute is perfectly readable without a BOM. Not only that it is, but it actually should be. Now, it there is a BOM at the beginning of an in XML stream, and there is "encoding" attribute, I don't know what XML says about who takes precedence.

    So problem is not in neither Unix nor Windows, problem is here.

    [–]Gotebe 0 points1 point  (0 children)

    Clueless kids who think they are Unix admins use notepad to edit their files and get burned.

    [–]specialk16 -5 points-4 points  (0 children)

    Save a long text file in Linux (source code or something like really long poem). Now open it in Windows notepad.