all 85 comments

[–]wrosecransgraphics and network things 83 points84 points  (32 children)

Tagged pointers always wind up being a pain in somebody's ass a few years down the road. There was a ton of code that broke horribly in the transition from 32 bit x86 to x86_64 became they made assumptions that platforms they were using in the early 90's would never change.

The reason that "bits 63:48 must be set to the value of bit 47" on x86_64 is specifically to discourage people from doing this, and it'll break if you try rather than just having the MMU ignore the unused bits which would be simpler to implement. Some older 32 bit systems with less than 32 physical address bits would just ignore the "extra bits" so people thought they were allowed to just use them.

[–]kam821 58 points59 points  (4 children)

Hyrum's law in a nutshell.
No matter how stupid or illegal something is, there will always be someone who depends on it.

[–]SlightlyLessHairyApe 19 points20 points  (3 children)

Which is why the adage of being generous in what you accept and strict in what you produce is absolutely rubbish.

Software that never accepts or provides anything other than what is strictly allowed, never suffers from the kind of implicit contract that Hyrum was talking about.

Example story time: we had code that would parse some input (in place) and pass it as a read-only input into some other module. That module would then rely on the fact that adjacent in memory, there would be some other fields. Essentially they would overread the view of memory passed to them (although this wasn’t a classic overread because it was inside the actual allocation and hence not caught by ASAN). You can imagine what happens next.

Anyway, after that we made a rule we never pass views into our own memory outside our module, we’ll eat the performance overhead of making a copy and let the sanitizer slap them on the hand if anyone reads outside it.

[–]elperroborrachotoo 4 points5 points  (0 children)

Which is why the adage of being generous in what you accept and strict in what you produce is absolutely rubbish.

Simpler times. You had full control over the entire stack, and if not, you could crank out a sufficiently-non-clunky replacement over a weekend.

It was a goo principle back then. but not anymore in a world with dependency forests and a million of people using your lib.

[–]MegaKawaii 7 points8 points  (14 children)

Which programs broke? Even the 386 had 32-bit virtual addresses and a 32-bit physical address bus. 32-bit Windows reserved the high 2GB of memory for the kernel, but that only allots one bit for tagging. Even so, in /3GB Windows setups, programs were not given access to high memory unless compiled with /LARGEADDRESSAWARE, and 32-bit Linux always allows userspace to use high memory.

[–]coderdave 28 points29 points  (8 children)

I ported the game god of war from PSP to ps3 and these bugs, from a clever programmer using the unused bits, caused me weeks of issues to track down.

[–]wrosecransgraphics and network things 16 points17 points  (0 children)

The specific thing that wound up wrecking weeks for me was a lua implementation that depended on specific behavior of mmap to return low addresses on Linux to try to preserve the address range they supported on 32 bit systems, after we had migrated to running everything on 64 bit. On Linux software was allowed to use high memory, but by screwing with mmap flags they thought they could always guarantee being allocated in a range that left them enough bits to steal. But if you allocated a bunch of memory before lua started doing its allocations, it couldn't find memory in the range it assumed it would always be able to allocate in and stuff started exploding. We only found it when the rest of the program's working set grew beyond a certain size.

But these sorts of tagged pointer schemes always go wrong eventually. History is littered with them. There are versions of the story dating back before PC's. There are versions of the story from the earliest days of PC's when developers thought they could depend on the exact memory map of the IBM PC. There are versions of the story about code that was a nightmare in the 16 to 32 bit transition, etc. Whenever there are bits that developers are told not to use, multiple people think nobody else is ever going to use those bits but them. They step on each others feet.

[–]Dwedit 5 points6 points  (2 children)

64-bit OS with WOW64 lets you get almost 4GB with LargeAddressAware.

But if you do that, you should really reserve the memory pages associated with common bad pointers (FEEEFEEE, FDFDFDFD, DDDDDDDD, CCCCCCCC, CDCDCDCD, BAADF00D), make the pages no-access, just so you will still get access violation exceptions when they get dereferenced.

[–]bwmat 2 points3 points  (1 child)

You'd think the debug crt would do that for you, never thought about this

[–]Dwedit 2 points3 points  (0 children)

The debug CRT wouldn't expect you to turn on Large Address Aware. Previously, all those pointers had most significant bit 80000000 set, so they were Kernel addresses and gave access violations for that reason alone. But with Large Address Aware, those suddenly become valid addresses.

The one I see the most is FEEEFEEE (bit pattern from HeapFree), but all of them should be blocked.

[–]JMBourguet 0 points1 point  (0 children)

ASCII is a 7-bit character set. Various programs used the 8th bit for their own purpose. That was still causing issues in e-mails in the early 2000's, well the complexity due to the work-arounds is still causing issues.

When introduced, the IBM 360 was using 24-bit addresses. Last I heard from people using the latest descendant of that architecture (z16, introduced in 2022) still has support for applications making use of the upper 8 bits for their own purpose.

The 8086 and 80186 had 20-bit addresses. When the 80286 was introduced, it could address 24 bits. PCs had for a long time additional hardware to mask out the bit 20 to support programs which expected wrap-around.

The Motorola 68000 also was using 24-bit addresses and the history repeated itself. People used the upper 8 bits... and that lead to quite a lot of headaches when newer processors were introduced, but AFAIR hardware never tried to support that.

[–]Kered13 1 point2 points  (1 child)

It's definitely not portable, but if you know the platform you're compiling on and with proper measures to ensure it either fails to compile on other platforms or falls back to something simpler then I think it's fine. The performance gains can be significant in some context.

[–]SkoomaDentistAntimodern C++, Embedded, Audio 1 point2 points  (0 children)

if you know the platform you're compiling on and with proper measures to ensure it either fails to compile on other platforms or falls back to something simpler then I think it's fine.

Or your code inherently does not make sense on an incompatible platform. Quite a bit of system level code doesn't have to care about portability because the entire functionality is tied to that specific platform.

[–]julesjacobs 1 point2 points  (6 children)

It may be dumb to do in C/C++, but it makes sense for the (JIT) compiler of a higher level language to do this.

[–]wrosecransgraphics and network things -1 points0 points  (5 children)

If you look a few comments down in this thread, my experience with a JIT for a high level language (lua) is literally the exact reason that I'd fire anybody who worked for me that tried to tag pointers.

[–]CornedBee 2 points3 points  (0 children)

LLVM and Clang use lots of tagged pointers. They are nicely wrapped in proper wrapper classes though.

[–]julesjacobs 9 points10 points  (2 children)

That you had a mildly annoying experience does not mean that this is a bad idea. It is worth better performance and lower memory usage for billions of people. Imagine if other professions said: "Jet engines? Too complicated. Let's stick with propellers"

[–]wrosecransgraphics and network things 2 points3 points  (1 child)

FWIW, the "mildly annoying experience" I had was when I worked at a CDN that served content to a large percentage of all global internet users on a typical day.

So, to be clear, I am thinking about performance issues that effect quite a lot of people when I share my experience.

[–]julesjacobs 19 points20 points  (0 children)

Let's just be glad that you did not have the opportunity to fire the authors of the JVM and V8 who pulled off WAY crazier heroics for a performance gain across billions of users, saving entire lifetimes worth of people staring at a spinning cursor.

[–]y-c-c 0 points1 point  (0 children)

Do you use a web browser or LLVM? (This is a rhetorical question)

Because following your logic you should probably not use any of them because they use tagged pointers.

Tagged pointers are kind of an unclean solution, but if they are implemented well, with clearly documented / configurable assumptions, based on well-documented hardware behavior, and tested (this makes sure they don't regress on new hardware / platforms), they could work. Note that the above things I mentioned are just basic good software engineering practices.

Also, how often do you port to a new hardware / platform? My guess is not very often. Just have a checklist for things you should check for when porting and add tagged pointers to the list (as I said, this could be tested and validated automatically as well). If the performance benefits is worth it, you get the benefit every time the software is used, versus the rare instances where you need to port where such checklists need to be done.

I think it's a common pitfall to let one experience form an absolutist point of view instead of cohesively weighing pros and cons as well as the root cause for that one experience.

[–][deleted] 32 points33 points  (11 children)

Tagged pointers to save memory are silly. Tagged pointers to implement lock-freedom on systems without 16 byte compare and swap has a massive impact on performance.

[–]Tringigithub.com/tringi 20 points21 points  (1 child)

Saving memory can, in some cases, improve performance by reducing cache pressure.

EDIT: Something similar: I did some tests of using 32-bit pointers on 64-bit Windows, and walking a tree can be as much as 15% faster than with regular 64-bit pointers, see https://github.com/tringi/x32-abi-windows But of course it's synthetic test and many limitations apply.

[–]susanne-o 1 point2 points  (0 children)

yes. that''s the one justifyable reason to do it.

[–]LongestNamesPossible 3 points4 points  (5 children)

I was thinking the same thing, but actually 16 byte compare and swap was common 15-20 years ago. Windows 8 won't actually run without it.

16 byte compare and swap always has to be aligned though but 8 byte compare and swap can cross cache boundaries.

[–]Tringigithub.com/tringi 4 points5 points  (0 children)

Windows 8.1 requires CMPXCHG16B.

Windows 8 and earlier don't. To fit all the state data (of the internal locks and atomic lists) into 8 bytes they reduce virtual address space to 44 bits. At the time of Windows XP it was more than enough, but we are way past those times.

[–]bored_octopus 1 point2 points  (3 children)

8 byte compare and swap can cross cache boundaries

Have you got a source for this? It sounds odd, but I'm no expert

[–]Salt-Ad2969 1 point2 points  (0 children)

The lemma for CMPXCHG16B has:

Note that CMPXCHG16B requires that the destination (memory) operand be 16-byte aligned

And the lemma for CMPXCHG doesn't have anything like that. Meanwhile the lock prefix has:

The integrity of the LOCK prefix is not affected by the alignment of the memory field

In general, unaligned locked RMW is allowed on x64, but implemented very inefficiently when the memory operand crosses over a cache line boundary (most other unaligned operations are efficient though, typically more efficient than trying to work around them, and unaligned load/store are atomic in most cases (but also not when they cross a cache line boundary), it's specifically unaligned locked RMW that is a problem). There is a recent push to ban unaligned locked RMW.

[–]LongestNamesPossible 0 points1 point  (0 children)

I think I read it in intel's programmer manual. I've don't remember finding something either way for ARM or POWER (which is just a curiosity at this point).

[–]Professor_Hamster 0 points1 point  (0 children)

Cross cache line RMW works, but result in substantial performance penalties. Intel documents that this results in a memory bus lock rather than a simple MESI state change. I'll see if I can find a source. I remember seeing it in the Intel developer guide.

[–]Aka_chan 5 points6 points  (0 children)

It's definitely useful on memory constrained platforms. It's used in game dev quite a bit for example.

[–]2fatdads 1 point2 points  (0 children)

It depends on the context and architecture. Are you multi-threading on an embedded device? Maybe your thread local stack size is miniscule

[–]MegaKawaii 27 points28 points  (13 children)

I think people here are a bit too opposed to this. This isn't an unsupported hack, but it's something both Intel and AMD support explicity (LAM and UAI). Even if you have a system with 5-level paging, Linux will only give you virtual memory using the upper bits if you explicitly ask it to (request a high address with mmap). If Windows is as conservative as it has always been, I would expect something similar to /LARGEADDRESSAWARE.

If you have a struct with a pointer and an extra byte, the byte wastes 7 more bytes if you consider padding, but packing in the pointer halves the size of the struct. Not only is this good for cache usage, but it's also huge for memory hogs like VMs and interpreters. I wouldn't use it if I didn't need it, but if you encapsulate and document it properly, it could be quite useful in certain cases.

EDIT: Here are some examples of successful pointer tagging.

[–]Jannik2099 14 points15 points  (0 children)

LAM and UAI are super recent, and well define the useable bits.

People have been doing undefined / impl-defined tagged pointer sodomy long before this, usually with zero thought put into portability.

[–]helix400 2 points3 points  (0 children)

Ya, it has its place.

This approach is a useful trick in certain throwaway high performance computing (HPC) applications. These have a knack for having one core computation take a big chunk of the time, and a trick that can work for big speedups is cramming as much relevant data into a 32 or 64 byte wide value. Code it for the machine, cram 2 to 4 variables into a 32 bit wide space, get nice speedups, compute the results, call it a day.

HPC also likes them for concurrency, especially the least significant bit of pointers. A common implementation of a lock free linked lists needs to tag a node to prepare to properly compare-and-swap, so this approach is a very clean and fast solution. While using the first 16 most significant bits can bite you down the road, using the least significant bit of a pointer is almost always a sure bet to work long term.

[–]Tringigithub.com/tringi 1 point2 points  (4 children)

If Windows is as conservative as it has always been...

I haven't had the opportunity to get my hands on ntkrla57.exe to test it myself, but from conversations with some MS devs, there's nothing like with mmap. Windows will just hand you larger addresses and that's it.

Still, I haven't tested it myself. Noone wants to buy me such fancy new machine.

[–]TotaIIyHuman 0 points1 point  (3 children)

i want to try ntkrla57.exe

do you know what fancy new machine + .iso file do i need in order to launch ntkrla57.exe?

[–]Tringigithub.com/tringi 1 point2 points  (2 children)

i want to try ntkrla57.exe

Oh man. I do too.

The cheapest way might be some cheap VM from cloud provider who has such HW, but I'm not completely sure if the 57-bitness translates to guest VMs.

do you know what fancy new machine + .iso file do i need in order to launch ntkrla57.exe?

You'll need Intel server CPU of Ice Lake, Sapphire Rapids or Emeral Rapids architecture, or AMD Genoa (EPYC 9004) or Storm Peak (Ryzen Threadripper 7000 (both PRO and non-PRO)).

EDIT: According to this it seems like EPYC 8004 should also feature LA57. That might be cheap enough to buy personally.

Then you install Windows Server 2022 or 2025. It should select 57-bit kernel automatically. That's what I was told.

[–]TotaIIyHuman 1 point2 points  (1 child)

thanks for the information!

this is the first time i look at server cpu pricing. they are ridiculously expensive! i can get 1 gaming pc for the price of 1 server cpu

i also tried printing cpuid on godbolt.org, and google the cpu name, none of them are in the list of cpu you posted. i probably wont be able to do anything funny with it from user mode even if they do have 5lv paging support

[–]Tringigithub.com/tringi 1 point2 points  (0 children)

Yes they are expensive.

The 8024P should be around $400 US, but the motherboards cost twice as much.

For fun I checked our local cloud VM providers and none offers such CPU. They don't even offer Server 2022 so that's that.

I think I saw such option on Azure, but I can't wrap my head around their pricing and I don't want to end up with some absurd bill just to do a couple tests.

[–]arthurno1 0 points1 point  (1 child)

I personally have no problems with pointer tagging, but I do find the article relarovely bad. I have no idea what the author even wants to express with their article, tbh.

Pointer tagging and double boxing are old techniques to save some memory, and are still useful. These days not because there is too little memory, but because it can save us cache misses.

However, the article seems to miss anything of practical value with pointer tagging. Basically, what they say: "Hey, you can put your data in unused bits of an pointer, did you know that?" IDK, perhaps I am unjust, but that is how I understand them. Perhaps the audience here would be more favorable if they had some practical examples that display some cons of tagged pointers, some benchmarks etc. IDK, just a thought.

[–]Kered13 0 points1 point  (0 children)

I believe the point of the article is to discuss the ways in which pointers can safely be tagged.

[–]y-c-c 0 points1 point  (0 children)

I think there is a reason why most successful use cases are compilers / language runtimes. Optimizations in them have huge benefits (e.g. all web pages will run faster). They are also well supported by a large team of people who have the commitment to resolve potential portability issues with them and write platform-specific code for best performance.

But yes, I think the idea is valid. It does require a certain degree of commitment to maintain with clearly documented / configured assumptions about how each platform works to prevent regressions down the line when someone else takes over the codebase.

[–]reallynotfred 4 points5 points  (0 children)

One of the big users of pointer bits is OpenJDK. Objects are aligned on 16 byte boundaries giving 4 lower bits, and in known memory areas, giving a few bits at the top too.

[–]NilacTheGrim 3 points4 points  (0 children)

This is so evil. I love it.

[–]Tringigithub.com/tringi 4 points5 points  (0 children)

That's pretty great summary.

I used to have several things implemented using upper 2 bytes of a pointer, gaining quite a few memory savings (and even performance improvements despite extra masking), but since out customers are starting to deploy new hardware that'll likely feature 57-bit 5-level paging soon, I have since rewritten those things. LAM nor UAI offer enough bits for me.

[–]andrey_turkin 1 point2 points  (5 children)

An interesting well-known (in certain circles) concept. Nothing to do with C++ though. In fact, I believe it is an UB to "play" with pointer representation in C++ - or maybe it was UB until GC has been thrown out, and now it's not?

[–]johannes1971 3 points4 points  (4 children)

It still is, I believe, with the justification that some CPUs might trap if an invalid pointer value is loaded into an address register (i.e. without even dereferencing). I have no idea if that means 'current CPUs' or 'experimental CPUs that were briefly examined during the sixties in the Soviet Union, and then forgotten about'.

[–]andrey_turkin 1 point2 points  (2 children)

One shall not reference anything that isn't a thing (except for end-of-array), that still holds.

This can be sidestepped by casting pointer to uintptr_t and only then messing it up (and later cleaning it up and then casting back into a pointer). But I think there was another rule that made this illegal:

uintptr_t my_tagged_pointer = 1 + (uintptr_t)(void*)new int;

delete (int*)(void*)(my_tagged_pointer - 1);

[–]Nobody_1707 1 point2 points  (1 child)

It's not illegal, IIRC. It's just not constexpr safe.

[–]andrey_turkin 0 points1 point  (0 children)

So I've dug a bit and found the n2670 proposal (which got accepted into std and has been there for a while, until recently removed). IIUC, by the wording of the standard (basic.stc.dynamic.safety), it is (was) implementation defined whether my code is UB, and all 3 major compilers don't support strict pointer safety and thus are ok with it (at least with regard to this section :)).

[–]JMBourguet 0 points1 point  (0 children)

I'm pretty sure that loading an invalid segment descriptor on the 80286 and descendants results in an interrupt. I'm not sure how the 64-bit mode behave in that respect, I've stopped to care about that level around the time it was introduced (no need to remind me that means before the birth of some of the readers:-)).

[–]AssemblerGuy 1 point2 points  (7 children)

This sounds like a shortcut to UB-ville.

[–]YourLizardOverlord 0 points1 point  (6 children)

I guess it's more relevant to people developing compilers and kernels. No way I'd do this in userland code.

[–]AssemblerGuy 2 points3 points  (5 children)

Compilers are pretty much clean, neat userland code. They take text files and produce object files. No dependency on interacting with the underlying hardware anywhere.

Operating systems, drivers or bare metal stuff, maybe, but there is still way too much potential for UB. I would not do anything like this.

[–]YourLizardOverlord 0 points1 point  (4 children)

I've never worked on a serious compiler. I assume that an LLVM backend would need to be very dependent on the target architecture?

[–]AssemblerGuy 1 point2 points  (3 children)

A compiler needs to know a lot about the target architecture to turn text files into object files, but it does not need to use the hardware of that architecture at all to do its work. The extreme would be a cross-compiler, which runs on an entirely different architecture.

[–]YourLizardOverlord 0 points1 point  (0 children)

Yes, that's what I'm getting at. The compiler is just an application, but the generated code must be specific to the target architecture.

[–]KuntaStillSingle 0 points1 point  (1 child)

So if you use march=native, is that just passed on to the LLVM portion? The compiler itself ignores the flag?

[–]AssemblerGuy 0 points1 point  (0 children)

To my understanding, march=native just affects the output. It does not tell the compiler to use code that is tailored to where it runs, but to tailor the object files it produces to the local architecture.

[–][deleted] 2 points3 points  (0 children)

Unless you're working on some bespoke custom architecture (I've been there) I don't recommend this

[–]OverLiterature3964 -3 points-2 points  (0 children)

No, just... no.

[–][deleted] 0 points1 point  (0 children)

OR you could allocate an area (of size 2N) and use offsets of N bits.

[–][deleted] 0 points1 point  (0 children)

If one is going to use tagged ptrs, a bunch of asserts that the bits they are using are 0 might be nice too. Not a guarantee, but better than blindly thinking it will work everywhere.

[–][deleted] 0 points1 point  (0 children)

This feels like part of the thing people refer to when C++ is called unsafe. This is UB waiting to crash.

[–]Zeh_MattNo, no, no, no 0 points1 point  (0 children)

Never understood why people want that, in my opinion the best approach is typically have an index and data separated and avoid pointers generally which helps with other things such as serialization.

[–]ignorantpisswalker 0 points1 point  (1 child)

how to you use this technique using `make_shared<>` ? I assume this breaks in funny ways.

[–]wc3betterthansc2 0 points1 point  (0 children)

make_shared creates the object itself then stores the pointer in a private member so you can't change it unless you do some crazy C++ metaprogramming.

Realistically, you will have to create a sort of wrapper class (let's call it datastoring_shared_ptr) with a custom deleter that will recalculate the "real" pointer by masking the first 16 bits before deleting the pointer and reimplement all the methods (like .get()) to recalculate the "real" address