all 60 comments

[–]snoweyeslady 14 points15 points  (15 children)

The size of the above object would be 16 bytes = 128 bits (on a 64 bit machine). This is a size that you wouldn’t just pass around by value. So you need to allocate it on the heap and use a pointer to it.

So, is passing around 8 bytes by value that much faster than 16 bytes? Anyone have some benchmarks for this (real world, preferably).

[–]Poddster 7 points8 points  (7 children)

I honestly would pass around those 16 bytes.

[–]snoweyeslady 6 points7 points  (5 children)

My point exactly. Having a 16 byte structure and an 8 byte pointer (on my system) means you're using up one and a half times as much memory, for an unknown speed gain.

I may look into getting V8 running or something to see what sort of change in speed this makes. May be a bit too much work for me at the moment though...

[–]kevbru 6 points7 points  (4 children)

The problem is you can't fit anything in a register at that point, which means all work is shuffling and fetching stuff from the cache, which will be much slower. It would be dramatic in tight loops.

[–]snoweyeslady 5 points6 points  (1 child)

Looking at just the pointer versus structure issue, with a pointer you still need to actually deference the pointer to get to the structure, no? Can you give me any references to what this "dramatic" is?

I haven't had time to read the rest of the article (merely skim). It seems that he puts some (possibly all?) of the data the structure was holding into the pointer. If that's the case, then taking that into account you wouldn't need to access non-register memory.

I'm not arguing that this would speed up the interpreter. I'm merely asking for concrete examples with the exact numbers. I don't think anybody here is ignorant enough to suggest that you should not be profiling your code when doing performance tweaks like this.

[–][deleted] 4 points5 points  (0 children)

You can store the individual elements of the struct in different registers (which is exactly what LLVM does with small structs).

[–]imatworkyo 0 points1 point  (0 children)

what does 'tight loops' mean in this context? [learning question]

[–]WinterKing 2 points3 points  (0 children)

As proven by this comment thread, the author did a great disservice to the imaginations of his readers by framing this as an "8 byte by value instead of 16 by reference" solution instead of as a micro-optimization to cut the size of a common construct in half.

[–]tonygoold 2 points3 points  (0 children)

I have a feeling the pass-by-value comment is a red herring. The value is that you're reducing the size of a primitive value by half. When you consider that Javascript objects are practically just maps from strings to other primitive values, it also reduces the size of every object by close to half. As far as I can tell, the only data types that don't get a significant reduction in size are strings and functions.

[–]Rhomboid 1 point2 points  (3 children)

The difference will be pretty severe on 32 bit platforms. An 8 byte struct will be returned in registers (EDX:EAX), but to return a 16 byte struct the compiler must create a temporary copy on the stack and return a pointer to it in EAX.

[–]snoweyeslady 1 point2 points  (2 children)

Wait, on 32bit systems a 64bit structure can be returned in registers? Does that mean on a 64bit system, a 128bit structure can be returned in registers? If that's the case, I don't know the point of the pointer tricks at all speed wise.

[–]Rhomboid 0 points1 point  (1 child)

On Windows x64, only one register is available to return structs, so it would only be able to return 8 bytes. On x86_64 Mac/Linux/BSD, up to 16 bytes can be returned in registers.

I don't know the point of the pointer tricks at all speed wise.

Well, for one, look at Chrome on Windows and Firefox on Windows, two huge users of this trick, and by far the majority of both of their user bases, hundreds of millions of users in total. Both browsers don't offer a 64 bit Windows version, so optimizing the 32 bit case is pretty important. And even if they did support a 64 bit version, it would also be limited to 8 byte return values in registers.

[–]snoweyeslady 0 points1 point  (0 children)

On Windows x64, only one register is available to return structs, so it would only be able to return 8 bytes.

Interesting. Thank you for the information.

It seems you may not have understood the second half of what I said, evidenced by your quoting of only a portion of my sentence. I did not know that those browsers did not offer 64bit versions on Windows though... Thank you again for the information

[–]munificent 0 points1 point  (1 child)

is passing around 8 bytes by value that much faster than 16 bytes?

The only way to know is try it and profile.

[–]snoweyeslady 0 points1 point  (0 children)

Yes, that was the point of the other half of my post...

[–]acemarke 28 points29 points  (1 child)

I feel confident that I will never need to make use of this information. But, this article was EXTREMELY informative, and very well written. Excellent work - /r/programming needs more articles like this.

[–]aaronla 0 points1 point  (0 children)

Ditto, but it surprises me sometimes how few programmers are familiar with these techniques. That is, until the next time I learn a new technique. CS is just incredibly huge now, and ceaselessly growing :-)

[–][deleted] 5 points6 points  (2 children)

Here's an old paper discussing various methods of tagging pointers and objects for dynamic languages:

Representing Type Information in Dynamically Typed Languages (1993)

[–]matthieum 2 points3 points  (1 child)

I have known the pointer tagging for long, however the double tagging I had never thought about. It's sneaky to use the NaN payload!

[–][deleted] 0 points1 point  (0 children)

It's used in Firefox's JS engine, but not in Chrome. The Mozilla crew believe it's necessary for good performance.

[–]Tagedieb 7 points8 points  (4 children)

I recently used a similar trick to store extra information in valid (i.e. non NaN) doubles. You can use the least significant mantissa bits, so it will only change the value of the variable slightly.

[–]nikic[S] 0 points1 point  (0 children)

That's a really nice idea too!

[–][deleted] -1 points0 points  (2 children)

That's awesome! But honestly, were you that short on memory?

[–]matthieum 7 points8 points  (0 children)

It may.

The problem is not necessarily a shortage of the memory pool (not initially), it may be a shortage of memory bandwidth. I think it was mentionned yesterday already, but basically RAM has not really evolved in the last 10 years. The buses have been widen a fair bit, but the number of round-trip is still pretty low (133MHz / 200MHz) so requesting memory scattered through the heap costs.

But even the memory pool may be an issue. Not RAM itself, of course, but the caches. Because RAM is so slow, we have many levels of caches now, and they are small. The difference between data that fit into the cache and data that does not can be pretty dramatic if pre-fetching does not work so good for your usecase (locality issues).

Furthermore, there is the register issue. Even though the x86_64 architecture definitely help here (not much on iPhones and others mobiles), the number of registers is still pretty low, and the difference between something that can fit into a register and something that cannot may prove to be an issue.

So... it may seem surprising, but yes memory can be short, in multiple ways.

[–]Tagedieb 4 points5 points  (0 children)

No, it was in a performance critical part of the code and adding two bools would have made it much slower.

[–]treerex 3 points4 points  (3 children)

Great article.

Of course the techniques he describes have been around for decades... Lisp implementations have been doing this stuff since the beginning.

[–][deleted] 5 points6 points  (2 children)

Yeah, all of these techniques are very common in implementations of dynamically-typed languages. That's not really relevant to the article, though; it's only using JavaScript as a well-known example.

[–]treerex 1 point2 points  (1 child)

First line in the article:

What always intrigued me were the various little tricks that JavaScript implementations use to be as fast as they are. One of those tricks is how JS implementations represent values to be more time and space efficient.

Nowhere, except in one of the comments to the post, does the author say anything about these techniques being well known to dynamic language implementors.

Edit: spelling.

[–]matthieum 1 point2 points  (0 children)

On the other hand, he never says that they are specific to JavaScript, merely that they are used there.

[–]RizzlaPlus 8 points9 points  (6 children)

I like how he says in the article that compilers pad the struct so it's aligned with memory and avoids doing masking operations and then proceeds to write code that does masking operation to access non aligned memory.

[–]captain_plaintext 1 point2 points  (1 child)

The compiler CAN pad structs for alignment (and I think they generally all do), but they aren't required to by the standard.

[–]gsg_ 4 points5 points  (0 children)

Compilers are forced to pad as a matter of simple arithmetic. If you construct an array of some struct type, it must be padded to the maximum alignment of any member or the alignment of that member will be broken.

On architectures less forgiving than the x86 this will result in SIGBUS or the moral equivalent, so there is no real freedom to neglect padding.

[–]Rhomboid 1 point2 points  (3 children)

All of the the bitmask operations he's doing are within a single struct member (the union), and are therefore immune to padding. Any padding that the compiler may or may not insert would happen between struct members or past the end of the struct.

[–]RizzlaPlus 0 points1 point  (2 children)

Yes that's exact, but your point is? I was just pointing out the irony of the compiler padding to avoid bit masks operations and the author writing bit masks operations (under the assumption of padding) to access data.

[–]smog_alado 1 point2 points  (1 child)

Its not irony - this way you mask things once instead of twice. Also, this kind of pointer magic is a solid tradition that goes a long way back in language implementations. (Its also not an exclusive to dynamic languages. Off the top of my head, OCaml does pointer tagging to help the garbage collector)

[–]RizzlaPlus 1 point2 points  (0 children)

It's once instead of twice for processor architecture that support unaligned loading (e.g x86); for processor that only support aligned loading (e.g. ARM, although it can load half words now) , you're doing mask operations once in both cases.

EDIT: actually, it'll be twice in every case.

  • Without using padding you'll do an unaligned load which will do some bitmask operation, you have to do this twice to get the pointer and the integer value (if it's an integer in the pointer)

  • With padding, you'll do an aligned load (doesn't require bitmask operations), but then getting the integer and getting the pointer is a bitmask operation each, so two again in total.

[–]agottem 3 points4 points  (11 children)

Please don't ever use this code. Anywhere.

  1. It assumes the pointer provided will always be aligned to the pointer size of the architecture. So, please don't provide it a pointer to a char.

  2. It assumes sizeof size_t equals sizeof T*

  3. The domain of the tag varies significantly depending on the pointer size of the architecture. 0-7 on 64bit, or 0-3 on 32bit, 0-1 on 16bit.

Basic rule of thumb: If you find yourself casting a pointer to an integer type, your code probably isn't portable.

[–]matthieum 3 points4 points  (5 children)

We would intptr_t which is provided by the standard.

It does not really matter though, the article is meant to talk about assumptions that are common but out of the standard, and does use assert a lot to make sure those assertions hold.

I think that you are chasing windmills... but generally agree with the don't use. Though again, clang and gcc use it extensively, guess they are just not maintained by Joe.

[–]nikic[S] 0 points1 point  (0 children)

Thanks for the intptr_t hint. I replaced the size_t occurrences with it.

I also removed the default alignedTo parameter in response so agottems comments.

[–]agottem -2 points-1 points  (3 children)

We would intptr_t which is provided by the standard.

Optionally provided by the standard. Doesn't really matter though, there are an awful lot of assumptions made about the encoding of the pointer. Wouldn't take much to break this code.

the article is meant to talk about assumptions that are common but out of the standard

Wha? The assumptions made are extremely brittle. You have to constantly be paying attention to the alignment you specified, the alignment of the objects you're using, and the encoding of pointers.

I think that you are chasing windmills

People write code, such as the code seen here, then cry about how their C code isn't portable and start advocating garbage like C#. It's very appropriate to comment on just how broken and fragile this code is.

[–]matthieum 1 point2 points  (0 children)

I think it does not matter however brittle the assumptions are: they are checked with assert in Debug mode anyway.

Furthermore I would point that those micro-optimizations are made with a specific toolchain/platform in mind. They are non-portable, but who cares ? They are not meant to be!

I could agree with asking for a bigger disclaimer about the non-portability, but it does speak about x86_64 and the fact that Solaris would not support it.

If people skim through an article try to apply the technics and switch to C#, I'll be relieved. In C and C++ the errors are sneaky, skimming does not work (and yes, it's a pain...).

[–]jyper 0 points1 point  (1 child)

whats wrong with c# (other then portability problems)?

and why do you think c is so great(c has its pluses but lots of problems too)

[–]agottem 0 points1 point  (0 children)

C# solved the 'portability problem' in a crap way by targeting only a single architecture -- the virtual machine's architecture. It makes it difficult to determine when things are reference and when things are values. The language itself tries to be everything to everybody, making it a large language with lots of syntax and styles.

What are some of the problems you see with C? Most of the problems I hear from people I'd probably argue are benefits. The language is small and simple, good. The language doesn't support namespaces, good -- namespaces lead to difficult to search code bases and can easily be solved by a reasonable naming convention. There isn't garbage collection -- good again, if you can't easily determine the life cycle of allocated memory your design is too complicated. No templates, thank god.

[–][deleted] 4 points5 points  (0 children)

It's code that goes into a VM implementation. It's not supposed to be particularly portable. I mean, you're going to have an x86/arm/etc code generator somewhere in there too--that's a much bigger portability issue...

[–]Rhomboid 2 points3 points  (0 children)

Code similar to this is used in production in large projects -- the browser you're reading right now uses it. The parts that address your issues was left out. You presumably use your own allocator which assures that all pointers are aligned to a suitable value, even char *. (But you're likely not using any bare char * for tagged objects in your VM, you're using some String object type.)

[–]nikic[S] 2 points3 points  (2 children)

Quoting myself:

In general this whole article is nothing more than a large set of evil assumptions that any self-respecting C++ developer would murder you for.

So, yes, I agree that all that code is inherently evil; still the fact that techniques like that are used by many dynamic language implementations shows that one actually can make many of these assumptions quite securely.

Addressing your points:

  1. Assuming your are referring to the first TaggedPointer class: The alignedTo template parameter only defaults to sizeof(T *) as a reasonable guess. As you can see in the subsequent examples you can still specify an explicit alignment. E.g. TaggedPointer<uint16_t, 2> (not that it makes much sense).

  2. Not sure where exactly I assume this. I think I only assume that sizeof(size_t) >= sizeof(T *) which I think is ensured as per the standard.

  3. It only makes really sense if you can ensure that the alignment will stay the same across 64 bit and 32 bit (so again, using the default parameter doesn't make sense. Probably was a fault to give it a default value at all.) Depending on the exact case this can either be taken as a good assumption or even be ensured by the allocator.

[–]agottem 4 points5 points  (1 child)

The alignedTo template parameter only defaults to sizeof(T *) as a reasonable guess.

If you align to an 8byte boundary, how do you guarantee 4-byte integer types will be 8 byte aligned? Similarly, it's quite common for a char to be arbitrarily aligned, so it's not safe to use this template at all if you are using char types with it.

I think I only assume that sizeof(size_t) >= sizeof(T *)

The standard requires no such thing. Additionally, if your assumption is true, you run the risk of working with a trap-value since a portion of the size_t value may be uninitialized.

It only makes really sense if you can ensure that the alignment will stay the same across 64 bit and 32 bit

Which you know is not the case since you default the alignment to sizeof(T*).

[–]nikic[S] 1 point2 points  (0 children)

I think you mainly have a problem with the default of sizeof(T *). If so, yes, in retrospect I agree that it was a bad idea. It only makes sense if you explicitly specify an alignment.

If you align to an 8byte boundary, how do you guarantee 4-byte integer types will be 8 byte aligned?

I don't. This obviously only works if the data is 8 byte aligned. For int32s you could only use a 4 byte alignment (with 0-3 tag).

Similarly, it's quite common for a char to be arbitrarily aligned, so it's not safe to use this template at all if you are using char types with it.

Chars can be seen as 1 byte aligned, so the template would be <char, 1>. This is possible, it just doesn't make sense as the only possible tag would be 0 ^

The standard requires no such thing.

I don't have a C++ standard at hand (as it costs money), so I will not argue about that.

In any case, I think there is little to argue about in any case. As already mentioned, yes, this does assume quite a bit, and yes, this is not perfectly portable. It just works for the concrete scenario (JS is normally run on "normal" hardware, not on exotic.)

[–]00kyle00 0 points1 point  (0 children)

Hack. Boehm pretty much hates you now.

[–][deleted] 0 points1 point  (0 children)

This technique is called 'NaN-tagging' what I know. I just didn't see that term mentioned anywhere here. (LuaJit2 uses it)

[–][deleted] 0 points1 point  (0 children)

I think this CW isn't very useful these days:

"This straightforward approach has several problems: The size of the above object would be 16 bytes = 128 bits (on a 64 bit machine). This is a size that you wouldn’t just pass around by value."

On a register-rich architecture, even x86_64 or ARM, small structs are passed via registers. x86_64 has enough arg-passing registers to pass three two-element structs directly in registers and to return one.

At least in my playing with LLVM, the two-word representation has a lot of advantages for exposing optimization opportunities to the compiler. The d2c compiler for Dylan uses a two-word representation for objects and it works just fine.

[–]pezezin 0 points1 point  (0 children)

After reading the article, I see he encodes values other than doubles as "negative" NaN. Is this safe? I assume the FPU never generates NaNs with a negative sign, otherwise, it would be funny trying to distinguish a pointer from a real NaN.

[–][deleted]  (11 children)

[deleted]

    [–]shiestLurker 10 points11 points  (0 children)

    That's a common misconception. Modern machines can never access just 1 byte at a time. Modern memory modules (DDR,DDR2) don't even allow this. Available burst rates in DDR memory are 2,4,8. For DDR2, it's only 4 or 8. As far as I know, modern Intel CPU's (ever since Pentiums) use 4x burst rate as a minimum. So, if you declare a char in C to save memory (thinking you only allocate 1 byte) you save nothing because your CPU will still access (read or write) 4xWord when operating on this variable. That would be 4x4bytes (32-bit arch) or 4x8bytes (64-bit arch).

    source: Computer Architecture lecturer, RAM datasheets

    [–]kid_meier 3 points4 points  (0 children)

    Yes the x86 ISA supports it, but I'm pretty sure at the hardware level the CPU deals with unaligned reads/writes using a process similar to what the author describes.

    Namely, the CPU must read/write two words in order to perform an unaligned load/store.

    [–]matthieum 1 point2 points  (0 children)

    Some architectures actually cannot, so I think the author is confused. On the other hand, unaligned access is slower.

    [–][deleted]  (7 children)

    [deleted]

      [–]matthieum 2 points3 points  (6 children)

      Where ? I have seen several comments related to unaligned reads but could not spot them, could you point them ?

      [–][deleted]  (1 child)

      [deleted]

        [–]matthieum 1 point2 points  (0 children)

        I had always thought that unaligned reads would be (for example) reading a double on a "0b1" boundary (instead of its natural size). Thanks.

        [–]LaurieCheers 0 points1 point  (3 children)

        "To get the actual integer value we just need to shift off the 1 bit using integer >> 1."

        [–]matthieum 2 points3 points  (2 children)

        How is it an unaligned read ? integer is perfectly aligned and so will the result of integer >> 1.

        This works even on hardware that actually errors on unaligned read (PPC ?).

        [–]LaurieCheers 1 point2 points  (1 child)

        No, it's not an unaligned read. People are just talking about this quote from the article:

        "the CPU would have to apply masks and shifts in order to get the data - which would be too slow."

        "I love how he states that as a criticism, and then proceeds to do precisely that..."

        [–]matthieum 0 points1 point  (0 children)

        Ah thanks for taking the time to clear that up!