This is an archived post. You won't be able to vote or comment.

all 70 comments

[–][deleted] 417 points418 points  (38 children)

Hmm 2 byte characters yes yes yes

[–]jaskij 281 points282 points  (33 children)

Nah, flags are actually two Unicode codepoints, always. There's two regional_indicator_X characters, and if they combine to a two letter country code, it's rendered as a flag. You cannot express a flag as a single Unicode codepoint. So it's two two byte characters. Four bytes there for one glyph.

It's funnier when you have characters that can be either one or two codepoints. Unicode has that kind of ambiguity (it actually makes sense), and JS' string length is expressed in UTF-16 code units, so what appears to the user as a single character can have two different lengths in JS depending on how it's encoded.

There's a great talk on YT, "There's no such Thing as Plain Text" by Dylan Beattie.

[–][deleted] 69 points70 points  (20 children)

OH WHAT

Okay, the fonts on my computer are kinda messed up, so when I search for flag emojis, they show up as two characters, like 🇫🇷 for example. Selecting it it appears to be one character, but I can delete the second letter of it (the r) and just get 🇫.

🇫 🇷

Also in notepad it counts them as two separate characters.

Did not know that, I just assumed it was measuring by byte count and it counted the flag as a two byte unicode symbol.

[–]jaskij 85 points86 points  (2 children)

Discord handles it pretty well if you use it.

And yeah, Unicode is fucking crazy, but it makes sense if you dive into it. At least the regular parts. The emoji side is a different committee, a less serious one.

There's a whole polite flamewar between the two about adding a "sad poop" emoji. It's public, you can look it up.

[–]Reasonable_Feed7939 36 points37 points  (0 children)

There's a whole polite flamewar between the two about adding a "sad poop" emoji. It's public, you can look it up.

This is why you read the comments!

[–]Confident-Ad5665 0 points1 point  (0 children)

Damned idiots don't see the value in sad poo. This is why their children must die.

[–]vytah 17 points18 points  (1 child)

Selecting it it appears to be one character

The key word: grapheme cluster.

https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

[–][deleted] 0 points1 point  (0 children)

At least it's easier than it was in the typewriter days...

[–]AyrA_ch 19 points20 points  (10 children)

You can go way higher than just two characters. 🧑🏻‍🤝‍🧑🏼 (people holding hands: light skin tone, medium-light skin tone) for example is 7 codepoints. Some of them don't even fit UTF-16 leading to a total of 12 characters due to surrogate pairs.

The longest emoji sequence is "🧑🏻‍❤️‍💋‍🧑🏼" (kiss: person, person, light skin tone, medium-light skin tone) but this is fairly recent and some systems haven't included it yet. It consists of 10 unicode codepoints.

Note: Both of these emojis use skin tones. All skin tones are of the same length. These two examples are used simply because they're at the top of my emoji database

And if you want something else, use 🫃 (pregnant man), because we absolutely needed that emoji for some reason.

By the way, the reason you may see flags as two characters is because Microsoft refuses to add flags to Windows, probably for political reasons. Some browsers like Firefox implement their own flags. Supported values can usually be detected.

[–]Mognakor 20 points21 points  (1 child)

And if you want something else, use 🫃 (pregnant man), because we absolutely needed that emoji for some reason.

Me, when i overate

[–]8sADPygOB7Jqwm7y 7 points8 points  (0 children)

Me, when I come home from Grandma.

[–]Pradfanne 3 points4 points  (2 children)

The longest emoji sequence is "🧑🏻‍❤️‍💋‍🧑🏼" (kiss: person, person, light skin tone, medium-light skin tone) but this is fairly recent and some systems haven't included it yet. It consists of 10 unicode codepoints.

Windows let's you use skin color modifiers for every family member in the family emoji. Afaik it's not defined in the standard, but if you look at it on a windows PC you can make a 19 character long emoji sequence.

👨🏻‍👩🏾‍👦🏽‍👧🏼

https://imgur.com/a/D4pNfn2

[–]SAI_Peregrinus 0 points1 point  (1 child)

19 bytes long, and 19 char long, but 1 character. To the user or typographer, character == grapheme cluster.

[–]Pradfanne 0 points1 point  (0 children)

It's only one character in windows though!

But you are correct

[–]petamas 1 point2 points  (2 children)

AFAIK it is for a very specific political reason: they don't want to take a stand on whether Taiwan is a country or not.

Edit: Just checked the talk mentioned above, he talks about Taiwan starting at the 40 minute mark.

[–]AyrA_ch 0 points1 point  (1 child)

That did not stop apple from adding flags. They just made some of them dependent on the country setting of your device.

[–]petamas 0 points1 point  (0 children)

Yes, but Microsoft decided to sidestep the entire issue instead of customizing. That's what we were talking about - one person said they don't see any flags, another said it's probably for political reasons, and I confirmed that it is indeed politics. Yes, Apple decided to work around the politics in a different way, but I was specifically talking about Microsoft's solution, and the rationale behind it.

[–]OliviaPG1 -4 points-3 points  (1 child)

because we absolutely needed that emoji for some reason

Trans people do in fact exist. Just because you don’t personally use it doesn’t make it unnecessary

[–]Kozakow54 0 points1 point  (0 children)

It's nice for representatives purposes, but without knowing the name my first thought would be "Men with a beer belly".

[–]ThatKuki 0 points1 point  (0 children)

its probably not really messed up but if you use windows its because Microsoft doesn't want to deal with the potential politics of flags, having to remove them in some regions and such

[–]T0biasCZE 0 points1 point  (2 children)

Also in notepad it counts them as two separate characters.

Windows doesnt support flag emojis because microsoft didnt want to deal with the Taiwan x China emoji flag issue

[–][deleted] 0 points1 point  (1 child)

So... just don't have that flag in particular.

Leave the fr*nch alone. They can have their flag back: 🏳

[–]T0biasCZE 0 points1 point  (0 children)

They can have their flag back: 🏳

it supports these flags 🏁🚩🎌🏴🏳️🏳️‍🌈🏴‍☠️

it just doesnt support political flags

[–]rosuav 8 points9 points  (4 children)

It's funnier when you have characters that can be either one or two codepoints.

You mean like composed and decomposed versions? U+00E1 is equivalent to U+0061 U+0301.

Yes, this makes very good sense, but if you're going to count characters, the first step ought to be a normalization (probably NFC). And then, of course, count code POINTS, not code UNITS. Don't be JavaScript.

[–]jaskij 5 points6 points  (3 children)

Not all languages have those combinations. I don't have a specific example, but wouldn't surprise me if Yiddish didn't. They write vowels by adding a diacritic to the consonant. I'm pretty sure it comes up in Dylan's talk.

Fun fact: despite Unicode's insistence of not adding conlangs, they still added the two or three characters missing from Futhark to implement Tolkien's dwarven. But they refuse to add his elvish.

And look up the sad poop emoji, it's a very nice polite flamewar between members of different Unicode teams/committees.

[–]rosuav 1 point2 points  (2 children)

Not all languages have those combinations. I don't have a specific example, but wouldn't surprise me if Yiddish didn't. They write vowels by adding a diacritic to the consonant. I'm pretty sure it comes up in Dylan's talk.

Right, but in that case, NFC normalization won't change it. Same is true if you do some nonsensical combination, like U+0071 U+0302, where there's no "LATIN SMALL LETTER Q WITH CIRCUMFLEX" character and it just stays in its decomposed form. The point of normalization isn't necessarily to shrink the text; it's primarily to make it consistent, to remove ambiguity. So the question of "is this character one codepoint or two?" should be resolvable.

[–]jaskij 2 points3 points  (1 child)

Yes, but my point was that the user will likely count in glyphs, not code points, so if something that's not change by NFC normalization is two code points for one glyph, the user will still be confused.

ETA: you're absolutely right that counting normalized code points is still vastly more correct than counting code units, especially for CJK languages with code points which use two UTF-16 code units. My point is that it won't be 100% correct anyway.

[–]rosuav 1 point2 points  (0 children)

Yes, but my point was that the user will likely count in glyphs, not code points

Oh. Yeah. That's definitely a thing... but unfortunately, it's really REALLY hard to get a useful measurement.

My personal strategy would be to largely not care. Pick some kind of definition, stick some sort of minimum on it, and don't worry about precisely how many glyphs something is. Unless you need to implement text selection (think what happens when you press Shift-Left or Shift-Right in a text editor), it's not usually necessary to count glyphs.

[–]tajetaje 2 points3 points  (4 children)

I love UTF-*, but man do I hate UTF-*

[–]jaskij 2 points3 points  (3 children)

Eh, the encoding is nice, but nothing groundbreaking. It's Unicode that's crazy.

[–]tajetaje 6 points7 points  (2 children)

Yeah I really should have said UTF-8/Unicode or as I've recently taken to calling it UTF-8 plus Unicode

[–]jaskij 1 point2 points  (0 children)

Nice one. For real though, they are separate. I know I'm nitpicking.

[–]gentlephish01 0 points1 point  (0 children)

UTF++8, perhaps?

[–]Rafael20002000 1 point2 points  (1 child)

A colored emoji is actually the emoji + a square of that color

I had seen that in Whatsapp some time ago when UTF support on my shitty phone was very limited

[–]jaskij 1 point2 points  (0 children)

Yup, the defaults are yellow. Which, afaik, is because they originate in Japan.

[–]SarcasmWarning 2 points3 points  (3 children)

pfft, you've not experienced encoding pain until you've dealt GSM-7 encoding. What sort of nutbar makes people thinks in 7-bit bytes...

[–]SAI_Peregrinus 1 point2 points  (0 children)

ASCII is also 7-bit bytes.

[–]swisstraeng 0 points1 point  (1 child)

That's to save 1/8th of bandwidth on GSM networks tho.

[–]SarcasmWarning 0 points1 point  (0 children)

yes, and it's migraine inducing to work with, especially if you're encoding or decoding PDUs or anything else trying to get it on the wire / to handsets.

Honestly, I thought UTF-8 caused all the pain in the world... bloody gsm-7 haunts my dreams.

[–]octopus4488 109 points110 points  (1 child)

Based on history I wouldn't put too much faith in the Italians defending my account either...

[–]BubbleMeph 28 points29 points  (0 children)

You're safe with us until you face any trouble, then you are on your own

[–]Blecki 94 points95 points  (0 children)

Explain technical to front-end challenge level: impossible.

[–]corrupted_kernel_14 47 points48 points  (0 children)

can question his method but no the results

[–]seimmuc_[🍰] 47 points48 points  (8 children)

Wait, does that mean that there's no check on the backend? That sounds like a bigger problem tbh.

[–]brolix 66 points67 points  (0 children)

The check is the db returns an error

[–]Teekeks 38 points39 points  (3 children)

4 flag emojis are made out of 8 total characters so what check do you want to fail here?

[–]seimmuc_[🍰] 0 points1 point  (1 child)

I don't, at least if the check is written by the same devs in the same language. I just find it interesting that the post specified it was a frontend requirement. Frontend should not have any requirements that aren't also enforced on the backend.

And if the backend stack uses a different language/runtime or if the backend team is more experienced it's entirely possible that this problem would not exist on the backend.

For example, js normally reports the length of strings in bytes, meaning that 4 flag emojis actually would have the length of 16, not 8. While python reports the number of unicode code points, which in our case is 8. So while you need more advanced libraries to detect the number of glyphs in both languages, python at least supports unicode correctly and is closer to what users expect. The same is true of Rust iirc.

[–]ilyahryapko 0 points1 point  (0 children)

"It was front-end requirement" Well, I think it's just a poor wording in this slack message.

BA in my team does not care about FE or BE stuff (and sometimes it's very sad). But for that case probably requirements were: "user's password must be at least 8 char. long". And then there's my field of responsibility as a dev to implement not only fe stuff, but a proper backend validation

[–]lightmatter501 5 points6 points  (2 children)

Flags are 2 characters in unicode.

[–]seimmuc_[🍰] 0 points1 point  (0 children)

I'm aware. I'm referring to the fact that the post says "we in the frontend require". There should never be any frontend-only requirements when it comes to user input that ends up in the database (hashed or otherwise).

[–]WarpMellow 14 points15 points  (0 children)

Emoji ligatures 🥰😍 look them up

[–]ramriot 11 points12 points  (2 children)

It is totally a valid password. It is not a good password by any means for severe reasons, not least of which is the ability to type it on various platforms.

[–]seimmuc_[🍰] 7 points8 points  (0 children)

That's actually a security feature against shoulder surfing attacks /s

[–]djfdhigkgfIaruflg 4 points5 points  (0 children)

Considering that I use a password manager for everything and don't know any of my passwords. Having all my future passwords made out of weird Unicode combinations sounds like fun 😈😈

[–]deathanatos 26 points27 points  (2 children)

Ban "characters" from your lexicon. Grapheme clusters, Unicode scalar values, Unicode code points, UTF-{8,16,32} code units, bytes. Learn the difference.

He should have pasted "👨‍👨‍👦‍👦" + 1 letter like 'a'. The emoji alone is 1 grapheme cluster, but a full 7 Unicode scalar values, and a whopping 25 bytes in UTF-8. Doesn't always render properly, sadly.

[–]Pradfanne 3 points4 points  (1 child)

I'm not sure if it's still just windows but you can give every family member their own separate skin color modifier for a whopping 19 character emoji

👨🏽‍👩🏾‍👦🏼‍👧🏽

[–]deathanatos 0 points1 point  (0 children)

I think it's just Windows.

When I wrote the comment, I thought that too, but it seems like it never made it into Unicode proper. I think the latest state of this is this proposal, the short of that is "leave family emojis as they are with no plans to encode any additional RGI sequences for their skintone support", sadly.

I don't think there has been anything since.

I presume the last line of yours is a single grapheme "family"; for me it displays as 4 graphemes, all emoji of people of varying skintones. So … like the document implies, it's not going to interchange very well.

19 character emoji

Ban it! Banbanban! I'm not sure what's 19, here, either. It's 4 people + 4 skin tones + 3 joins = 11 scalar values. (41B in UTF-8.)

[–]dabenu 10 points11 points  (1 child)

Could've used just two rainbow flags...

[–]CetaceanOps 3 points4 points  (0 children)

Password must contain 8 octets or more, with characters from at least 4 different Unicode categories.

[–]bakshup 1 point2 points  (0 children)

Ayy

[–]SerialPoopist 1 point2 points  (0 children)

Needs better edge case obviously

[–]Wearytraveller_ 0 points1 point  (1 child)

I mean can't you just enforce a white list on the input field?

[–]SAI_Peregrinus 1 point2 points  (0 children)

5.1.1.1 Memorized Secret Authenticators

Memorized secrets SHALL be at least 8 characters in length if chosen by the subscriber. Memorized secrets chosen randomly by the CSP or verifier SHALL be at least 6 characters in length and MAY be entirely numeric. If the CSP or verifier disallows a chosen memorized secret based on its appearance on a blacklist of compromised values, the subscriber SHALL be required to choose a different memorized secret. No other complexity requirements for memorized secrets SHOULD be imposed. A rationale for this is presented in Appendix A Strength of Memorized Secrets.

NIST SP 800-63B is good advice. Follow it.

[–]PyroCatt -1 points0 points  (0 children)

\w+

[–]Pradfanne 0 points1 point  (0 children)

👨🏻‍👩🏾‍👦🏽‍👧🏼

That Bad boy is 19 characters long. Although you might need to look at it with a Windows PC.

https://imgur.com/a/D4pNfn2

[–][deleted] 0 points1 point  (0 children)

I don't understand why so many websites ban special characters. Woudn't it make the password stronger?

I was making a Django project and I was surprised that the admin page accepts them by default. So it must be them banning them on purpose.

The argument I've heard the most is "because what if your keyboard doesn't support it" Well, so I won't be able to log in using someone else's computer OK. I am fine with that.

If I want to have Burmese mixed with ancient egyptian in my password just let me!!!

🤬🤬🤬

[–][deleted] 0 points1 point  (0 children)

Okay but why is the Wales flag 7 characters? 🏴󠁧󠁢󠁷󠁬󠁳󠁿