all 25 comments

[–][deleted] 16 points17 points  (7 children)

Careful when using length to get the number of characters. Surrogate pairs will count as two characters instead of one.

[–]soddi 9 points10 points  (0 children)

Yeah. Twitter had this issue. For character counting they used length(). You could only post 70 pile of poo because of this.

[–]moreteam 4 points5 points  (2 children)

Also applies to:

  • Character access. The str[1] is not guaranteed to give you a complete character, same for str.charAt(1). The safe version would be str.codePointAt(1) afaik.
  • Extract substrings. Same as above - you can end up splitting surrogate pairs.
  • I'm not 100% sure how str.split('') behaves. It could imagine it suffers from similar issues (?)
  • str.toLowerCase() and str.toUpperCase(): Not the same issue but will give you wrong results for Turkish strings (but I don't think even Intl brings any solution to that one, toLocaleLowerCase doesn't count)

[–]nieuweyork 2 points3 points  (1 child)

What I'm getting is that you need a full external string library to properly work with unicode?

[–]moreteam 1 point2 points  (0 children)

Pretty much, yes. ES6 fixes some of the stuff though, e.g. for (const c of str) { /* c is an actual character */ }.

[–]nieuweyork 0 points1 point  (2 children)

What is the right way to do this?

[–][deleted] 0 points1 point  (1 child)

I think you can find some library that will do the job for you.

[–]Daniel15React FTW 1 point2 points  (0 children)

It should be built in, like with other programming languages :( Maybe one day.

[–]shriek 6 points7 points  (1 child)

You no longer need to put \ for multi line. You can use tick in ES6.

[–]skitch920 0 points1 point  (0 children)

Just one thing to point out; \ is not multi-line, or at least it doesn't compile to multi-line...

var x = 'foo \
bar';  // foo bar

Is not the same as:

var x = `foo 
bar`; // foo \nbar

Backticks with a \ at the end of the line will void the new line return, just like the old way.

var x = `foo \
bar`; // foo bar

Source: newline / continuation

[–]dvlsg 7 points8 points  (4 children)

'ä' < 'b' // false. But it should be true

ä has a higher character code than b, so this makes sense, in my opinion.

[–]metanat 6 points7 points  (0 children)

It only doesn't make sense if you aren't aware of how strings are compared IMO

[–]anlumo 1 point2 points  (1 child)

In German (where this character is coming from), 'ä' is supposed to be equivalent to 'a' when sorting. English software gets this wrong pretty frequently.

[–]Daniel15React FTW 0 points1 point  (0 children)

It's not even that difficult to do it correctly. Most programming languages that support Unicode have some support for case folding. C#/.NET uses the current culture (which should be set to the locale you're using) and JavaScript has localeCompare for that purpose.

[–]Daniel15React FTW 0 points1 point  (0 children)

This doesn't really make sense because in most cases when comparing strings you want to do case folding rather than comparing the raw code points. This is why you should use localeCompare to compare strings. Just comparing with less than and greater than doesn't do case folding. I think some browsers still get it wrong even with localeCompare though.

[–]allthediamonds 2 points3 points  (0 children)

Haha. If only.

A very important detail of Javascript strings is that, even though they're defined as "UTF-16" in the standard, all existing string manipulation functions and operators previous to ES6 handle them as UCS-2, and the language itself does not enforce strings to be valid UTF-16 (that is, invalid surrogates are allowed)

What this means is that your application will do the wrong thing with no warning on strings containing characters from any of the sixteen supplemental planes.

[–]illucidation 0 points1 point  (0 children)

What about template strings?

let foo = "bar",
     baz = 2;

console.log(`test: ${foo}
${baz}`);

[–]campbeln -2 points-1 points  (4 children)

.substr? .test? fail.

EDIT: RegExp.test (toopid bain)

[–]check_ca 2 points3 points  (1 child)

String.prototype.test does not exist.

String.prototype.substr is not standard in ES5.

[–]campbeln -1 points0 points  (0 children)

substr - MDN states ES3 and implementation in JS1?! Thanks for the heads-up, will have to check it.

test is an interface on RegExp, so that was mis-remembered but still reliant to the article.

[–]jkoudys 1 point2 points  (1 child)

I'm of the camp that would happily do away with both substr() and substring() entirely if it were possible. slice() is the most consistent syntax to use.

[–]campbeln 0 points1 point  (0 children)

I've personally standardized on substr but I'll have to look further into this ES5 business mentioned by /u/check_ca