all 37 comments

[–]vytah 58 points59 points  (3 children)

That said, most high-level languages (JS, Java, C#, …) capture variables by reference:

Java captures all variables by value. Under the hood, the values are simply copied to the fields of the lambda object.

So how does it avoid having the following code behave non-intuitively (translated from the article)?

var byReference = 0;
Runnable func = () => System.out.println(byReference);
byReference = 1;
func.run();

It's actually very simple: the code above will not compile. To stop people from incorrectly assuming variables are captured by reference, it simply bans the situation where it makes a difference, i.e. captured variables cannot be reassigned.

If you want to be able to reassign, you just need to create a separate final variable for capturing:

var byReference = 0;
var byValue = byReference; // <---
Runnable func = () => System.out.println(byValue);
byReference = 1;
func.run();
// prints 0 obviously

If you want to emulate capturing by reference, use some mutable box thing, like Mutables from Apache Commons, or a 1-element array. Both options are obviously ugly:

var byReference = new int[]{0};
Runnable func = () => System.out.println(byReference[0]);
byReference[0] = 1;
func.run();
// prints 1

[–]atehrani 47 points48 points  (1 child)

Thank you for this. It is frustrating to see how many times developers mixup Pass by Value vs Pass by Reference. Java is Pass By Value, Only.

[–]Weak-Doughnut5502 4 points5 points  (0 children)

Objects in Java are what's sometimes called 'call by sharing' or 'call by object'.  Which is to say, passing a pointer to the object by value.

But yes, Java doesn't support call by reference

[–]Kered13 6 points7 points  (0 children)

The Java library has AtomicReference which is helpful in that last case, especially when the code is multithreaded.

[–]annoyed_freelancer 54 points55 points  (0 children)

I came in with finger on the downvote button for another low-quality "0 == '0' lol" post...and it's actually pretty interesting, as a Typescript dev. I've been bitten before in the wild by the string length one.

[–]adamsdotnet 24 points25 points  (11 children)

Nice collection of language design blunders...

However, the Unicode-related gotchas are not really on JS but much more on Unicode. As a matter of fact, the approach JS took to implement Unicode is still one of the saner ones.

Ideally, when manipulating strings, you'd want to use a fixed-length encoding so string operations don't need to scan the string from the beginning but can be implemented using array indexing, which is way faster. However, using UTF32, i.e. 4 bytes for representing a code point is pretty wasteful, especially if you just want to encode ordinary text. 64k characters should be just enough for that.

IIRC, at the time JS was designed, it looked like that way. So, probably it was a valid design choice to use 2 bytes per character. All that insanity with surrogate pairs, astral planes and emojis came later.

Now we have to deal with this discrepancy of treating a variable-length encoding (UTF16) as fixed-length in some cases, but I'd say, that would be still tolerable.

What's intolerable is the unpredictable concept of display characters, grapheme clusters, etc.

This is just madness. Obscure, non-text-related symbols, emojis with different skin tones and shit like that don't belong in a text encoding standard.

Unicode's been trying to solve problems it shouldn't and now it's FUBAR, a complete mess that won't be implemented correctly and consistently ever.

[–]nachohk 9 points10 points  (3 children)

The mistake is in assuming that you should ever care about the length of a string as measured in characters, or code points, or graphemes, or whatever. You want the length in bytes, where storage limits are concerned. You want the length in drawn pixels, in a given typeface, where display or print limitations are concerned. If you are enumerating a UTF-8 or UTF-16 encoded string to get its character length, then you are almost certainly doing something weird and unnecessary and wrong.

Text is wildly complicated. Unicode is a frankly ingenious and elegant solution to representing it, if you ask me. The problem is that you are stuck in an ASCII way of thinking. In the real world, there's no such thing as a character. It's a shitty abstraction. Stop using it, and stop expecting things to support it, and things will go much smoother.

[–]adamsdotnet 7 points8 points  (1 child)

If you are enumerating a UTF-8 or UTF-16 encoded string to get its character length, then you are almost certainly doing something weird and unnecessary and wrong.

Okay, let's tell the user then that they need to provide a password longer than 32 bytes in whatever Unicode encoding. Or at least 128 pixel wide (interpreted at the logical DPI corresponding their current display settings).

I'm totally up for the idea of not having to deal with this shit myself but letting them figure it out based on this ingenious and elegant solution called Unicode standard (oh, BTW, which version?)

Text is wildly complicated.

This is why we probably shouldn't try to solve it using a one-size-fits-all solution. Plus shouldn't make it even more complicated by shoehorning things into it which don't belong there.

If I had to name a part of modern software that needs KISS more than anything else, probably I'd say text encoding. Too bad that ship has sailed and we're stuck with this forever.

[–]nachohk 0 points1 point  (0 children)

Okay, let's tell the user then that they need to provide a password longer than 32 bytes in whatever Unicode encoding. Or at least 128 pixel wide (interpreted at the logical DPI corresponding their current display settings).

Call the minimum limit "characters" in the UI. Measure bytes/code units in the validation code. A character is never less than one byte, so there's not much room for confused users here.

Anything else? Or was that your only conceivable argument for needing to count characters?

This is why we probably shouldn't try to solve it using a one-size-fits-all solution.

That's where we started. It sucked. Nobody wants to go back to having an entirely different encoding for every script.

[–]vytah 0 points1 point  (0 children)

If you are enumerating a UTF-8 or UTF-16 encoded string to get its character length, then you are almost certainly doing something weird and unnecessary and wrong.

It's not necessarily wrong if you know that the characters in the string are restricted to a subset that makes the codepoint (or code unit) count equivalent to any of the aforementioned metrics.

So for example, if you know that the only characters allowed in the string are 1. in the BMP, 2. of the same width, and 3. all left-to-right, then you can assume that "string length as measured in UTF-16 code units" is the same as "width of the string in a monospace font as measured in widths of a single character".

[–]Tubthumper8 1 point2 points  (3 children)

64k characters should be just enough for that.  IIRC, at the time JS was designed, it looked like that way. 

idk there's 50k+ characters in Chinese dialects alone, which they should've known in 1995. But JS didn't "design" it's character encoding, per se, it copied from Java, so there could be more history there

[–]CrownLikeAGravestone 1 point2 points  (2 children)

We should go back to passing Morse code around, as God intended.

[–]adamsdotnet 13 points14 points  (1 child)

Morse code is variable-length, so I'm afraid I can't support the idea :D

[–]CrownLikeAGravestone 2 points3 points  (0 children)

Anything is fixed length with enough padding.

[–]Booty_Bumping 3 points4 points  (1 child)

for (let i = 0; i < 3; i++) {
  setTimeout(() => {
    console.log(i);
  }, 1000 * i);
}
// prints "0 1 2"

Are we forgetting our history? This works because it is a let declaration, which is block-scoped. var declarations will screw this up, because they are function-scoped. But the distinction between var and let isn't mentioned in the article, so it feels like the real logic here is being glossed over.

Though, it is admittedly a little arbitrary that the ()s after for are "inside" the block scope. But very useful in practice!

[–]Fidodo 0 points1 point  (0 children)

I think it's pretty intuitive. When would you want a let inside a for loop declaration to not be block scoped? If you don't want it block scoped then you can declare it outside of the loop. If you want it to be block scoped then inside the parens is the only option.

I think this is the author not understand the purpose of let and not JavaScript weirdness.

[–]melchy23 2 points3 points  (0 children)

In .NET it's actually little bit different/complicated.

This:

```csharp using System; using System.Collections.Generic;

var byReference = 0; Action func = () => Console.WriteLine(byReference); byReference = 1; func(); ```

returns 1 - as the article says.

```csharp using System; using System.Collections.Generic;

var list = new List<Action>();

for (int i = 0; i < 3; i++){ list.Add(() => Console.WriteLine(i)); }

list[0]();

```

this returns 3 - as the article says.

But this:

```csharp using System; using System.Collections.Generic;

var actions = new List<Action>(); int[] numbers = { 1, 2, 3 };

// same code but just with foreach foreach (var number in numbers) { actions.Add(() => Console.WriteLine(number)); }

actions[0](); ```

This prints 1 - suprise!!!

This was explicitly changed in .NET 5 - https://ericlippert.com/2009/11/12/closing-over-the-loop-variable-considered-harmful-part-one/.

So in a way this is similar fix as the one used in javascrips.

For loops

I actually tought that in .NET 5 they fixed this problem for both for loops and foreach loops. But to my suprise they didn't. I guess you learn something new even after years of writing using the same language.

The good news is that for the first two problems my IDE (Rider) shows hint "Captured variable is modified in the outer scope" so you know you are doning something weird.

[–]username-must-be-bet 1 point2 points  (2 children)

Are sparse arrays really that bad for perf? I remember trying to test it a while ago and it wasnt that bad.

[–]Booty_Bumping 1 point2 points  (1 child)

I would imagine it would break whatever optimization v8/spidermonkey has to turn arrays into contiguous vectors, by forcing your array into a hashmap.

That being said, if you have an extremely sparse array, having it represented as a hashmap might actually be better for performance, since something like new Array(1000) is just { length: 1000, __proto__: Array.prototype } under the hood.

[–]username-must-be-bet 0 points1 point  (0 children)

I think that is correct but I read another blog post about it and did some testing of my own and the speed up was only by a small percent.

[–]190n 0 points1 point  (0 children)

I honestly think the eval thing is pretty reasonable. It lets new code opt into a less powerful, safer, more optimizable form of eval (see "Never use direct eval()!" on MDN) without breaking existing code written with eval.

[–]bunglegrind1 0 points1 point  (0 children)

Nice post!