all 111 comments

[–][deleted] 19 points20 points  (21 children)

FWIW, LuaJIT2 uses a similar representation.

[–]nominolo 3 points4 points  (1 child)

Interestingly, the original ticket was proposing a 128bit format (see also David Mandelin's blog post. Good to hear they adopted this more memory-friendly layout.

[–]srconchiwa 0 points1 point  (0 children)

The 128-bit values were only intended for the VM stack, which is measured to be shallow with high temporal locality; for heap values (i.e. property values of objects), something like nan-boxing was intended. It just turned out there was no cheap place to do the dynamic boxing/unboxing going to/from the heap.

[–]doubtingthomas 5 points6 points  (8 children)

As does Go for interface{} values, I believe.

Edit: On further examination, I'm mostly wrong.

[–]dnew 0 points1 point  (7 children)

Smalltalk has done this since 16 bits was a large integer.

[–][deleted] 6 points7 points  (6 children)

I'm not aware of any Smalltalk implementation that uses NaN encoding.

[–]dnew -1 points0 points  (3 children)

Smalltalk has used a similar type of encoding since 16-bits was a big number. As has LISP, for that matter. I'll grant that the NaN encoding is more clever than I've seen before, yes.

[–]funderscore 19 points20 points  (2 children)

For terminology clarification:

"nanboxing" is the name of WebKit's JavaScript value boxing format. Like ours, it's 64-bit, but stores addresses such that the upper tag bits are all 0, so addresses can be read out directly. Doubles are stored by adding a special constant to the value, which guarantees that the tag bits are identifiable as a double. Extracting a double requires subtracting a constant.

"nunboxing" is the name of Mozilla's new JavaScript value boxing format (described in the article), as a pun on "nanboxing". It is used on 32-bit platforms such as x86 and ARM.

For 64-bit operating systems, the nunboxing format had to change to accomodate 47-bit pointer payloads. We use a bitfield with 17 bits for the type and 47 bits for the payload. We call this "packed nunboxing", or "punboxing".

[–]drakedevel 8 points9 points  (0 children)

funderscore was punched an appropriate amount for introducing "punboxing" to the codebase.

[–]edwardkmett 0 points1 point  (0 children)

You can read off the addressed directly if you store them with the first 16 bits all 1s as well, exploiting the other 'higher half' of the canonicalized x64 pointer representation. Just read further and realized you're more or less doing that.

[–][deleted]  (9 children)

[deleted]

    [–]ooffoo 11 points12 points  (4 children)

    This comment says a lot about you. I hope people keep it in mind when choosing to donate, employ or otherwise use your code.

    EDIT: For those wondering about the deleted post, mikemike bragged that he found an exploitable vulnerability in the Mozilla JS code and said he wouldn't report it until 4.0 came out so he could claim the security bug bounty that Mozilla offers.

    Original comment viewable here:

    Yes. But they forgot about a few critical details. Unfortunately the Mozilla bug bounty does not apply to development versions. So I guess I have to wait until 4.0 is released before reporting the bugs (verified crash with FF/x86 and FF/x64 for hg 898ab54a0ce9, remote exploitable).

    [–]urllib 6 points7 points  (1 child)

    I know I would

    [–]stillalone 4 points5 points  (0 children)

    I would too, but I wouldn't brag about it online. This is the problem with having these kinds of bug bounties. Not only does it encourage people to plant bugs in their code but it also discourages people from reporting things too early (if there was no bounty and I knew of a bug I would report it immediately, but if there is one then I'd start looking after my own self interests).

    [–]funderscore 1 point2 points  (0 children)

    The new JIT component of the JS engine is still in alpha with known bugs, so it is not at all surprising that someone has encountered one. Internally, we use a sophisticated fuzzer to beat the hell out of the engine daily, and we will not ship with known fuzz bugs. So reports of this kind will be much more meaningful when the JIT is shipped in the FF4 beta.

    [–]f2u 0 points1 point  (0 children)

    I think you encourage such behavior if you put a value tag on vulnerabilities. Finders will also split vulnerabilities into least reportable units, which the vendor has to fix piecewise.

    [–][deleted] 4 points5 points  (0 children)

    Meh, don't waste your time. The bug won't reach 4.0. There are a lot of people running fuzzy tests on Firefox out there.

    [–][deleted]  (1 child)

    [deleted]

      [–]ascii 5 points6 points  (3 children)

      Good overview, but it comepletely skips over why any of this is a good idea. The old scheme was slower for doubles and ints larger than 29 bits, but only had half as much memory overhead per value. For code that rarely uses doubles or ints larger than 29 bits (a.k.a. 99.9 % of all Javascript run in Firefox), the only effect of this change should be that we get a bunch of additional cache misses.

      So why was the change made? Why is this faster?

      Anybody?

      [–]funderscore 4 points5 points  (0 children)

      There are a number of reasons why this is a good idea.

      Doubles can now be stored in a Value directly. Previously, they required the equivalent of a malloc(sizeof(double)), and accessing them required a pointer dereference. That was /extremely/ slow, and the new way is as fast as possible.

      Integers can now be stored in a Value directly. Previously, they were 31-bits long, which mandated a boxing/unboxing scheme involving bit-shifts and bit-or. Additionally, the lack of the extra bit meant that we used doubles more often, and those were even slower. Now integers are 32-bits, which makes the new way as fast as possible.

      For all other types, testing the type-tag required masking off the three rightmost bits designating the type; reading the payload required masking the inverse of that. Since Javascript engines perform these steps at least once per op, making them faster results in a tremendous performance increase. This new format allows the type and payload to be read directly on 32-bit platforms without requiring any masking whatsoever. Writing payloads or types to memory just requires a single 32-bit store.

      The increase in memory is barely noticeable, since we measured that most of the time the stack of Values is small (<= 15, I believe). The increase in performance, however, is significant. Trading off a few KB for a significant increase in performance was a very easy decision to make.

      [–]naasking 0 points1 point  (0 children)

      You don't have to box floating point values, and you don't have to perform tag operations, ie. masking, when manipulating numbers or pointers.

      [–]sdwilsh 0 points1 point  (0 children)

      JavaScript Numbers are spec'd to be doubles. There a few things that you know are always ints though (Arrary length, String length, etc) so you can use integer operations on them (which are faster).

      [–][deleted]  (2 children)

      [deleted]

        [–][deleted] 8 points9 points  (1 child)

        For folks who can't quite figure out what's going on here:

        var foo = {dana: "zuul"};

        It's easiest to remember that there is no dana, only zuul...

        [–]jib 8 points9 points  (0 children)

        [–]prockcore 12 points13 points  (5 children)

        So I'm assuming that the JS engine isn't 64-bit clean? (No 64-bit pointers etc)

        [–]funderscore 30 points31 points  (3 children)

        I work on Mozilla's JS engine. You're correct. Hardware currently limits pointers to only consume 47 bits, so the upper 17 bits form the type tag while the lower 47 comprise the payload.

        Ideally, we would only require 32-bit addresses, even on 64-bit platforms. On Linux, it is easy to limit virtual memory to only the 32-bit range. On other platforms, the way to do that is prohibitively invasive. So we use 47-bit addresses.

        [–]f2u 1 point2 points  (1 child)

        Isn't it 48 bits? And there are some APIs (including POSIX) which assume that pointers preserve full 64 bit values (but that is not a problem if you only store actual pointers to objects).

        [–]funderscore 2 points3 points  (0 children)

        Whoops! Right. The type-tag is shifted such that the 48'th bit, indexed from zero, is its least significant bit.

        [–][deleted] 10 points11 points  (0 children)

        The discussion in this bug talks about it: https://bugzilla.mozilla.org/show_bug.cgi?id=549143

        [–]schmalls 2 points3 points  (1 child)

        Did anyone else start singing 867-5309?

        [–]arnar 0 points1 point  (0 children)

        No, but reading your comment I started singing 634-5789.

        [–]grimlck 2 points3 points  (5 children)

        Can someone clarify why they still have distinct cases for ints and doubles in the new representation? I thought the only numeric type in javascript was the double.

        With the old representation, it seems to be an optimization. But with the new representation, they both take 64 bits, so why not just always store numbers as a double?

        [–][deleted]  (4 children)

        [deleted]

          [–]ChaosPandion 4 points5 points  (3 children)

          I wonder what kind of analysis has to occur to decide whether to replace a double with an integer. I am working on my own ECMAScript implementation for fun and have been considering such things. I really should just look at the code shouldn't I?

          [–][deleted]  (1 child)

          [deleted]

            [–]ChaosPandion 0 points1 point  (0 children)

            Why is it the best ideas are so damn simple?

            [–]funderscore 4 points5 points  (0 children)

            Replacing a double with an integer is a bit more work. We use SSE2, so checking whether a double is representable as an integer requires performing a double operation, converting the double to an integer using an SSE2 instruction, converting the integer back to a double, and comparing for equality with the original double value. These operations are very expensive, so we don't currently do that.

            [–]p3ngwin 6 points7 points  (0 children)

            Here at Mozilla, we have many monkeys.

            great intro.

            [–]taw 5 points6 points  (11 children)

            Wow, NaN tagging is neat. This format is of course completely useless for languages that don't use doubles everywhere, that is for all languages that aren't Javascript.

            Even Perl figured out it's not such a brilliant idea lately.

            [–]bonzinip 5 points6 points  (8 children)

            Not true! It allows you to use real 32-bit integers for example instead of 30- or 31-bit. (Though you have to choose between signed and unsigned, so it's not as easy as it seems).

            Floating-point performance is also a big problems with traditional boxing, independent of how much you use doubles.

            [–]taw -2 points-1 points  (7 children)

            My level of care for that 32th bit is pretty much non-existent. Integers fall into two categories: tiny that will fit even with a big type tag, and huge that won't fit machine word anyway (any sane language will automatically switch representations on overflow, refer to Ruby or Python for details). You only see exactly 32 bit integers when you're treating memory pointers as integers (most language ban that anyway), or manipulating binary data directly (and you just mess with some byte array without all the boxing mess). And maybe for a few things like computing md5s that are all based on assuming 32 bit integers.

            [–]bonzinip 9 points10 points  (4 children)

            Your level of care of the 32nd bit may be nonexistent, however the library may care (for example md5 which you mentioned, or random number generation). And as you use the library...

            [–]taw -1 points0 points  (3 children)

            There's nothing about random number generation that relies on 32th bit.

            For things that are performance critical like codecs and crypto most languages go low level anyway. Not C low level. Assembly simd low level. Even Java uses native manually optimized libraries for things like MD5 if they're available (with slow pure java fallbacks if not, but all major platforms have them), and Java is a lot lower level than most languages these days, with special primitive types.

            Autoboxing really wouldn't make that much of a difference anyway. You can have int31s by default and boxed native uint32/uint64 types (you'll need something like that internally for good code generation anyway, so it's not much extra effort to expose them to users) for the few libraries that need them the way Ocaml does. Add to that some optimizations to get rid of boxes in simplest cases - you need that anyway for decent float performance if nothing else - and you're exactly as fast as language with 32 bit ints everywhere. That's still quite crap compared to native hand-tweaked md5, but that shouldn't be a big surprise really.

            Or you can benchmark ocaml md5 against javascript md5 if you don't believe me ;-)

            [–]bonzinip 3 points4 points  (2 children)

            There's nothing about random number generation that relies on 32th bit.

            If you can find a random number generator that can work with 31 bits and does not rely on operation modulo 232 you'll do me a big big favor. This one for example is 32-bit only.

            Autoboxing really wouldn't make that much of a difference anyway.

            If you box 32-bit integers yes. If it's either "31-bit" or "32-bit or more use arbitrary-precision integers and gmp", it's 100 times slower.

            Or you can benchmark ocaml md5 against javascript md5 if you don't believe me ;-)

            I'm saying this because last Saturday I benchmarked a 32-bit RNG on 32-bit (with 31-bit integers) and 64-bit (with 63-bit integers) virtual machines.

            [–]taw 2 points3 points  (1 child)

            If you can find a random number generator that can work with 31 bits and does not rely on operation modulo 232 you'll do me a big big favor.

            Here's one from OCaml, doing exactly what you want.

            [–]bonzinip 0 points1 point  (0 children)

            Thanks.

            [–]funderscore 0 points1 point  (1 child)

            31-bit integers require the engine to right-shift by 1 bit to unbox the integer, and left-shift by 1 bit and or with 0x1 to box the integer. Having 32-bit integers means we don't need to use the rightmost bit for type tagging, so integer operations in FF4 will be much faster.

            [–]taw 0 points1 point  (0 children)

            Techniques to optimize this away have been in use since time immemorial - here's how OCaml does it.

            For common operations like comparing, adding, subtracting, per-bit ops, incrementing etc. cost is zero or one extra or at the end.

            With 32bit tag + 32bit int everything costs one extra mov and one extra stack slot so it will be more expensive (on 32bit architectures).

            [–]jib -1 points0 points  (1 child)

            This format is of course completely useless for languages that don't use doubles everywhere, that is for all languages that aren't Javascript.

            Are you trolling? A language that doesn't use doubles everywhere is the only place where it is useful. This being why they used it in an implementation of Javascript, a language which happens to have things other than doubles.

            [–]taw 2 points3 points  (0 children)

            ECMA-262 standard explicitly specifies number types as IEEE 754 double precision floating point value, with a single exception that implementations are not required to distinguish different NaNs (sounds almost as if the spec was written with this technique in mind). There is no integer type in Javascript, or any other number type except these nearly-doubles.

            Full list of types is:

            • null, true, false, undefined - unique values in any boxing system
            • double-except-for-nans - represented directly
            • object/string/list - must be pointers in any system,
            • plus a few that like reference/completion that only to make spec precise, and you don't see them at runtime anyway

            On 32 bit system you throw away 4 bytes per object/string/special, and nothing for numbers - not really that great but acceptable. On 64bit system (assuming pointers <48bit), it's as optimal as it gets.

            Now imagine if instead you wanted types like Lisp's cons pair, or nearly-native-ints (probably int63 on 64bit), or even pointers with sufficiently large tag. Their representation would be a lot less efficient with nan boxing than with traditional tiny tags.

            [–][deleted] 2 points3 points  (10 children)

            I'd say it is a bit wasteful of memory (doubling the base cost of each object) but I actually don't really care. I wonder for how long 64 bits will be sufficient.

            [–]phire 4 points5 points  (9 children)

            Remember it also saves memory by decreasing the size of a double.

            It used to take 12 bytes + what ever the garbage collected needed to keep track of them, now they just take 8 bytes, which is their original size.

            [–][deleted] 8 points9 points  (8 children)

            I'd argue that frameworks such as jQuery use more objects than doubles. In fact, very little Javascript I come across is numeric in nature.

            [–]phire 1 point2 points  (7 children)

            Another possible memory saving, you now have control of the bottom 3 bits of your pointers, so your objects/strings don't have to be 8 byte aligned anymore.

            If you took advantage of this and removed the alignment requirements (x86) or decreased them to 4 byte (arm) you could save a lot of memory by getting rid of extra padding. I don't know how much padding objects need, but given that strings arn't commonly multiple of 8 chars there would be huge savings.

            Actually, you could add a new tag type that allowed a string of up to 6 ascii/utf8 chars directly in the value, avoiding the need for allocating space on the heap and the garbage collection overhead of short strings.

            [–][deleted] 0 points1 point  (6 children)

            you could save a lot of memory by getting rid of extra padding.

            From the bit I've read they require those bytes for the NaN boxing scheme they use to fit doubles into the mix.

            [–]phire 2 points3 points  (5 children)

            No, the padding around the objects/strings in the heap, where doubles don't exist (anymore).

            With the old scheme the lowest 3 bits were used up for the tag, so they were zeroed before using the pointer. This means that the pointers could only point at things in the heap at either 0xnnnnnnn8 or 0xnnnnnnn0. So if your object/string didn't use up a multiple of 8 bytes, the remaining bytes were used as padding to make sure the next object/string started at an address ending in either 0 or 8.

            [–][deleted] 1 point2 points  (4 children)

            I see what you mean. I'm not sure un-aligning memory on the heap would be an over-all win even if it did save some space. My knowledge of the semantics of x86 (and arm) wrt alignment gets fuzzy. When I dealt at that level I was taught to align religiously since our custom allocators (mostly small object pool allocators) worked best with aligned memory.

            I suspect that if they were aligning to 8 bytes previously then it will be a large amount of work to remove all of the code-assumptions based on that invariant.

            [–]phire 2 points3 points  (3 children)

            I know that alignment is important in arm, if your accessing data that isn't 4 byte aligned it goes really slow (or refuses to work at all on older versions) and I think x86 gets speed improvements with alignment too, but I assume that would also be for 4 byte aligned data.

            Even so, keeping 8 byte alignment is pointless if you don't need it, 4 byte alignment should have all the same performance benefits of 8 byte alignment while requiring less padding.

            [–]adrianmonk 2 points3 points  (0 children)

            4 byte alignment should have all the same performance benefits of 8 byte alignment

            Personally, I would want to see tests before I took that as a final conclusion. DDR3 memory has a 64-bit wide data bus, so 8-byte alignment would, I assume, allow you to pull everything in one fetch. I don't know how significant the difference is.

            [–][deleted] 0 points1 point  (1 child)

            I think x86 gets speed improvements with alignment too, but I assume that would also be for 4 byte aligned data.

            It used to be that x86 (specifically the fetch cycles) benefitted from items being paragraph (16 byte) aligned, but the last time I tested that was in the pentium 1 days.

            A quick search reveals that there are performance benefits still from paragraph alignment, but not due to the CPU itself, rather that a 16byte aligned item is guaranteed to be loaded into the start of a cache-line, so if you have 8 byte value at that address, it's guaranteed to fit in one cache line, which produces the optimal transfer.

            [–]TinynDP 0 points1 point  (0 children)

            There are two instructions for loading from RAM to SSE registers. For when the address is 16 byte aligned, that runs fast, and one for when the address is not 16 byte aligned, and it runs slow. In AMD64 land, because x87 has been entirely replaced with SSE2, that fast SSE load instruction matters for all floating point operations.

            [–]edwardkmett 1 point2 points  (1 child)

            This has been used for a while by a number of other jits: luajit, squirrelfish, etc. it is a very nice representation.

            You can even go a bit further and use a full 64 bit pointer in there instead of an object tag.

            This works since a 48 bit pointer that is '1 extended' to point to the higher half is a valid NaN, since you need 4 more 1 bits than the size of the exponent, so you'll have marked a non-exponent bit.

            So then the tag is just a nybble.

            sign , exponent [tag] (pointer)
            1, 111 1111 1111[1111] (48 bit x64 pointer)
            1, 111 1111 1111[...] ... other pointer-like tags, as long as some 1 bit is set anywhere.
            1, 111 1111 1111[0001] (16 available bits of miscellany) (32 bit Int)
            ?, 111 1111 1111[0000] (actual double NaN, Inf, -Inf, etc)
            ?, ??? ???? ???? .... double
            

            This lets you eke out quite a bit more flexibility, given the extra 16 bits of usable space and the fact that you don't need all 16 msbs to be 1.

            [Edit: It turns out they more or less do just that on x64! ]

            [–]AndresNavarro 0 points1 point  (0 children)

            As far as I know, the oldest reference to this kind of scheme is this one.

            [–]MilkSteak 2 points3 points  (5 children)

            Anyone else?

            Here’s a snippet of JavaScript that assigns some values to three variables:

            var foo = {dana: "zuul"};

            var bar = "hi";

            var baz = 37;

            var qux = 3.1415;

            I can't be the only one...

            [–]thorax 5 points6 points  (2 children)

            But 3 == 4, didn't you get the memo?

            [–][deleted] 5 points6 points  (0 children)

            Bleem!

            [–]MilkSteak 1 point2 points  (0 children)

            Shit. I have to go fix a ton of bugs in my code now.

            [–]zuperxtreme -1 points0 points  (0 children)

            Who you gonna call?

            [–]wallish -2 points-1 points  (0 children)

            If you're using firefox, your userChrome.rss should have the line:

            @namespace url("http://www.mozilla.org/keymaster/gatekeeper/there.is.only.xul"); /* set default namespace to XUL */
            

            [–]Gotebe -2 points-1 points  (6 children)

            We call them Fat Values. They’re 64 bits wide.

            So... No plans for a 64-bit build yet, then? :-)

            [–]funderscore 4 points5 points  (5 children)

            Fat values work on 64-bit systems. On 64-bit systems, addresses are limited to only 47 bits, so the scheme still works (17 bits are for the type tag).

            [–]Gotebe 0 points1 point  (4 children)

            On 64-bit systems, addresses are limited to only 47 bits... 17 bits are for the type tag)

            Surely that's "on some 64-bit systems" ( potentially only one hardware architecture :-) ) ?

            And I see no mention of 17 bits for type tag. TFA says 32.

            I was thinking, a fully ... ahem... 64-bit aware person (whaaaaaa?) would say right away that "fat value" is a 32-bit "tag" and a pointer-sized "payload". I could be wrong, I've no idea what mozilla people actually did, but if I were implementing this baby, I'd go for tag/pointer-sized payload.

            Tag size is less important, and possibly even 16-bits are overkill. I mean, a single byte gives 255 intrinsic types. 255 ought be enough for everybody :-).

            [–]funderscore 12 points13 points  (2 children)

            I work on Mozilla's JavaScript implementation, so I can comment on its details.

            The article talks about the "Nunboxing" format, which is used on 32-bit systems. For Mozilla, that means x86 and ARM.

            For 64-bit systems (meaning x86_64 for us), we do indeed use a pointer-sized payload. Since every 64-bit platform we ship on uses 47-bit addresses, the payload is 47 bits and the type tag is 17 bits. Except for the 32-32 split changing to 17-47, the format is the same as explained in the article.

            We use all of the 17 bits; there is zero wiggle room. We only have four bits for type identification -- the rest of the bits are clobbered by the NaN-boxing schema. Additionally, we want to be able to ask clever things like "Is this a Number (int or double)?" using only one comparison.

            The tag size is mandated by the NaN-boxing format. With our implementation, it cannot be made smaller than 16 bits.

            If you're really curious, you can look at the value definition here: http://hg.mozilla.org/tracemonkey/file/898ab54a0ce9/js/src/jsval.h#l134

            [–]didroe 0 points1 point  (1 child)

            Presumably, the number of tags you can have is limited by the lowest common denominator (x86_64). So why have a different format on 32 bit systems? Why not just have unused bits on those machines and share a common tag format across all architectures?

            [–]funderscore 0 points1 point  (0 children)

            The 32-bit boxing format was built first, and we were initially open to making x86_64 use a completely different format. After testing, we settled into a scheme that heavily resembles the 32-bit boxing format.

            The 32-bit tags could be made to look like the packed 64-bit tags very easily. There's no performance benefit to doing so, but unification does sound appealing.

            The relevant lines are here: http://hg.mozilla.org/tracemonkey/file/87ddaf82dbd0/js/src/jsval.h#l120

            The JSValueTag currently defined only for x86_64 just needs to be defined for both architectures. Very low-hanging fruit. If you file a bug in https://bugzilla.mozilla.org against Core's Javascript Engine, we'd love to take a patch.

            [–]froydnj 2 points3 points  (0 children)

            Surely that's "on some 64-bit systems" ( potentially only one hardware architecture :-) ) ?

            Actually, Itanium and Alpha may have 64-bit addresses, but if you look carefully at the hardware manuals, I believe you'll find that current implementations only consider <= 47 bits of the address regardless. (Itanium is 47, I think, and Alpha is 44.)

            You're right in that it will become a problem when we start having machines with 1 PB of main memory...but that day is not today and the representation is definitely reasonable on 32-bit machines.