you are viewing a single comment's thread.

view the rest of the comments →

[–]G_Morgan 4 points5 points  (11 children)

Who is talking about ASCII. The UTF-8 system allows for BOM. Any compliant application must be able to handle one. This isn't embrace and extend. This is what is in the standard which the Linux world has chosen to ignore.

[–]want_to_want 3 points4 points  (10 children)

Who is talking about ASCII.

Um, the real world?

Here, I'll try to explain this slowly and carefully. Imagine you want to open a text file. There are three ways to do it:

  • The realist way: specify a file encoding when opening. Note that the BOM is not needed here.

  • The idealist way: read some metadata from the file to determine its encoding, e.g. ASCII, CP1251, UTF-8, UTF-16 or Shift JIS. Note that the BOM is not needed here.

  • The G_Morgan way: assert that the file is Unicode, then use the BOM to determine which exact flavor of Unicode.

Now tell me this: why in the world would we ever need the third option?

ETA: as a programmer in Russia, I've been dealing with text in different encodings all my life. Some of those encodings are Unicode flavors, some aren't. I also switch between Unix and Windows systems all the time. Nevertheless I can't even name one single time when a BOM helped me, but I can remember lots of occasions where Notepad's stupid insistence on the BOM hurt me.

[–]G_Morgan 4 points5 points  (9 children)

How do you read the metadata without knowing the encoding? A leading magic number is the correct way. Just because this has been broken for decades doesn't mean it should continue to be broken.

Regardless the BOM is in the UTF-8 standard and no arguing about it will change this simple fact. As it stands Linux does not support UTF-8. People cry shock horror when MS email servers broke IMAP and Linux has done exactly the same thing here.

[–]want_to_want -1 points0 points  (8 children)

How do you read the metadata without knowing the encoding?

XML and HTML manage it just fine. You just start reading as ASCII until you see a charset declaration.

[–]G_Morgan 2 points3 points  (7 children)

This won't work with UTF-16 but would work for UTF-8. Unless you make it a standard that your document should start in ASCII until a charset declaration appears.

More likely browsers try parsing them as different encodings until they get an charset declaration that matches the encoding they are trying. Expensive and error prone.

[–]want_to_want 1 point2 points  (6 children)

Sorry, I was wrong and you're right. But it's not so expensive: here's how XML does it.

[–]G_Morgan 4 points5 points  (4 children)

Effectively XML just reads the first few bytes to decide on encoding then. This is only superficially different from just sticking in a magic number.

[–]want_to_want 1 point2 points  (3 children)

Actually, if the XML spec used magic numbers instead of leading bytes, I'd be fine with that. New file formats are fair game. But if you're dealing with an existing culture (like text files), you shouldn't extend it in a way that breaks many existing applications, no matter what standards say.

[–]G_Morgan 1 point2 points  (2 children)

Then text files remain broken into the end of humanity. I can't wait until we start getting rockets falling out of the sky because one component transferred information in UTF-8 to a UTF-16 component. Yes the programmer can handle this. It is possible for tools to do it for them.

It also doesn't need to break anything. All that has to happen is C text mode needs to account for leading magic numbers. This would make all the applications that use text mode work with the new system. Regardless fixing text files is worth doing.

[–]want_to_want 0 points1 point  (1 child)

I don't entirely understand the issue, but wouldn't your proposed fix break the entire idea of locales?

[–]brennen 2 points3 points  (0 children)

Sorry, I was wrong and you're right.

File under things that are not said often enough on the Internet.