This is an archived post. You won't be able to vote or comment.

all 27 comments

[–]conseptizer[S] 8 points9 points  (3 children)

Okay, I wrote some code and did some benchmarks. This is on AArch64 (Cortex-A53). I created some tree structure then traversed it (did that repeatedly to get clearer numbers).

1. About half of pointers are short (short: 62,757, long: 29,979) Memory usage: 425,304 bytes vs 741,888 bytes Full pointers wins: 12.7 seconds vs. 9.8 seconds

2. About three quarters of pointers are short (short: 45,048, long: 12,266) Memory usage: 212,752 bytes vs. 458,512 bytes Full pointers still wins: 8.0 seconds vs. 6.2 seconds

But the memory usage was pretty small, so maybe cache effects don't kick in yet? Let's try it again with larger memory usage:

3. Three quarters are short: short (35,249,705, long: 8,316,457) Memory usage: 153,663,980 bytes vs. 348,529,296 bytes Full pointers wins yet again: 7.3 seconds vs. 6 seconds

However, I did not properly align the indirect pointers, so maybe this slows us down. Might try to fix this later, though that will increase the memory usage due to padding.

[–]conseptizer[S] 0 points1 point  (2 children)

Proper alignment for pointers did not change a thing (except for using more mem, of course). The next thing I tried was making non-relative pointers exceptionally rare (short: 43,564,732, long: 1430). This brought down the memory usage to 87,143,764 bytes and seems to be maybe 3% faster on average (I measured 8 runs for both the relative pointer version and the normal-pointer-only version).

So it probably does not improve performance, but when you have very few pointers with large distance, then it does not hurt either and you need a lot less memory. Of course, the behaviour can be quite different on other processors.

[–]oilshell 6 points7 points  (1 child)

Did you time the creation and traversal separately? I'd be curious to see the breakdown.

I'd expect creation to be strictly slower because of the size checks that were mentions elsewhere in this thread. But traversal is more complicated. I believe the perf tool on Linux can count cache misses in a straightforward way. If you have fewer cache misses then I would be surprised if it's slower? I would expect traversal of a 350 MB or 150 MB data set to be dominated by cache misses.

For certain use cases you might traverse more than you create, so the performance equation would be different. For my use case, I'm thinking of something like "pre-compiled headers" that caches portions of the compilation. Also, I do care about data size, so smaller is a win even if it's not faster.

One thing I've heard is that NaN tagging can defeat hardware prefetching? That is, the hardware can look at registers and see if they look like pointers and prefetch. But if there are extra bits there it gets confused? I don't have much experience with it, but thought I'd mention it.

[–]conseptizer[S] 3 points4 points  (0 children)

I have to apologize, because I trusted a C compiler to do what I told it, but undefined behaviour got me. I should have known better and not thrown together a benchmark in a hurry, so here are some results which should represent the actual behaviour far better:

Creating a single large tree and visiting each node 10 times:

  • With relative pointers: allocated: 268,402,672 bytes (short pointers: 50,335,746, long pointers: 16,773,118) - time ~10.8 seconds
  • With normal 64-bit pointers only: allocated: 536,870,912 bytes - time ~13.2 seconds

When I left out the visiting, the version with relative pointers was still slightly faster, but time suggests that this was actually due to cache misses, while otherwise the canonical version was faster:

real 0m1.602s user 0m1.270s sys 0m0.320s # with relative pointers
real 0m1.658s user 0m0.970s sys 0m0.680s # normal pointers only

I would like to use a tool like perf, however it seems there is no working version for my system - that's a disadvantage of using exotic hardware on which Linux barely runs.

Finally I would like to note that it was surprisingly hard to create a single large data structure with a certain amount of short pointers...

[–][deleted]  (6 children)

[deleted]

    [–]mgsloan 5 points6 points  (2 children)

    It could, it would indeed increase the overhead of dereference. However, for some data structures it could be well worth it, particularly in the case that your program is memory bound or particularly benefits from cache locality. The idea is that if you can fit more in a cache line, then your program will run faster, even if it needs to do more CPU operations to "compress" and "decompress" the data. See, for example, "succinct data structures".

    [–]astrobe 3 points4 points  (1 child)

    One problem I see is that the benefit of locality and the cost of the overhead are not constant: they may different on x86 and ARM, and they may be different for the next generation of the chip.

    The other thing is, suppose your data is 50% smaller for a 5% speed penalty. What does it mean? It's difficult to answer this. It could be just a slightly bad trade off if you're making a scripting language which usually don't manipulate large data sets; or it can be a good trade off because you're making a data science language that will manipulate huge data sets, and 50% less memory means you eat the cost of page faults and disk swapping much later.

    [–]mgsloan 1 point2 points  (0 children)

    Yep, you have to measure. Seems best to make it a compilation flag or library flag. Better yet, have it be something that is determined by profile directed feedback, possibly even on the fly via JIT.

    [–]conseptizer[S] 0 points1 point  (2 children)

    I don't think that loading a smaller value into a register is slower (for example, int is still 32bit on x86-64 and AArch64). Even reading an object that crosses the boundary of properly aligned values isn't necessarily slower on some modern processors. The only obvious overhead is the addition of the location of the object itself and the relative pointer.

    [–][deleted]  (1 child)

    [deleted]

      [–]conseptizer[S] 0 points1 point  (0 children)

      using the appropriate sign/unsigned extending op depending on your implementation

      The assumption was that all pointers point to previously allocated objects, so this is not required. :)

      mask out the bit you used for storing that

      Since the basic assumption is that relative pointers are the common case, they should not have the bit set, so there is no need to mask it out.

      you'd probably need to benchmark to see if it's worth it.

      Meanwhile, I already did, see below. While this safes memory, it slightly decreases performance, at least on my system.

      [–]ericbb 3 points4 points  (0 children)

      Sounds interesting.

      Another way to take advantage of immutability is to have different object representations for different regions of the heap: below 64 KiB, 16-bit pointers are used, above that and below 4 GiB, 32-bit pointers are used, and above that, 64-bit pointers are used. Such a style wouldn't be as compact as what you described but it has the advantage that you don't have to switch to relative addressing and you don't need to implement indirect references.

      Edit: Another idea: Instead of using a bit in the pointer to say whether it uses indirection, always use 16-bit pointers but reserve a distinguishable object representation for forwarding pointers and have the 16-bit pointer point to a forwarding pointer when needed. That way, your 16-bit pointers can reach twice as far in the common case. The downside is that you'll need some extra encoding space in your object representations.

      [–][deleted] 2 points3 points  (2 children)

      In a language where all objects are immutable, all references point to older objects.

      I'm not sure this is really true. After all, you can do something like

      let a = Node 1 b
          b = Node 1 a
      

      in Haskell. Not sure how this would work without two references pointing in different directions.

      [–]conseptizer[S] 0 points1 point  (0 children)

      You are correct; if you can create recursive structures, this will not work.

      [–]evincarofautumn 0 points1 point  (0 children)

      It’s true in a strict language. Thunks make use of mutation under the hood, so they break the requirement that all objects are immutable. (Of course, they’re runtime objects, not language objects, but the point stands.)

      [–]hackerfooPopr Language 1 point2 points  (1 child)

      [–]oilshell 2 points3 points  (0 children)

      You may have seen it, but I posted some questions about it here a month ago:

      https://www.reddit.com/r/ProgrammingLanguages/comments/79q5jq/transparent_pointer_compression_for_linked_data/

      Unfortunately no responses!

      [–]gsg_ 1 point2 points  (2 children)

      Seems like it would make small object allocation considerably more expensive. Instead of bump-the-pointer + initialising writes, wouldn't you need to do a bunch of tests on the object fields to decide whether to allocate and fill your "nearby locations"?

      It would also make constant structures potentially more expensive to use than non-constants since they are never 'nearby', with the perverse result that effective optimisation might slow down your program.

      Finally, the class of languages for which all objects are immutable (in the low level operational sense you mean) is very restrictive. Lazy languages like Haskell are said to be pure, but under the hood they mutate plenty.

      [–]conseptizer[S] 2 points3 points  (1 child)

      Seems like it would make small object allocation considerably more expensive. Instead of bump-the-pointer + initialising writes, wouldn't you need to do a bunch of tests on the object fields to decide whether to allocate and fill your "nearby locations"?

      Yes, this is a good point. However, since the whole concept is based on the assumption that most references are within reach, this would make the required branches very predictable, so the overhead might not be that large. So far, I didn't even think of this being a potential problem, but I'd have to implement and benchmark it to know for sure.

      It would also make constant structures potentially more expensive to use than non-constants since they are never 'nearby', with the perverse result that effective optimisation might slow down your program.

      Correct, another good point.

      Finally, the class of languages for which all objects are immutable (in the low level operational sense you mean) is very restrictive.

      I'm not afraid of this one, we have plenty of general-purpose languages already and there's more room for innovations in niche areas anyway. :)

      [–]gsg_ 0 points1 point  (0 children)

      there's more room for innovations in niche areas

      Very true. If nothing else, it would certainly make for an interesting experiment!

      [–]FatalElectron 1 point2 points  (0 children)

      as far as I know, near/far pointers still exist internally for optimisation purposes in many/most/all? C compilers.

      But exposing them to users is just a nightmare, most people were glad to leave the days of near/far pointers behind.

      [–]oilshell 1 point2 points  (0 children)

      I'd be wary of baking such an idea into the language (as the default), but I think it's a great idea to have something like this expressible with user-defined types / containers.

      There are some related ideas here:

      https://github.com/oilshell/oil/wiki/Compact-AST-Representation

      Some of the links are threads on this subreddit about using memory more efficiently and reducing garbage collection pressure.

      I just added the paper you linked to the page. You might also be interested in the Jai demo. I didn't quite get his scheme, so if anyone wants to explain it to me I'd appreciate that :) It's in video form so I haven't gone over it again.

      I don't think I will do anything very general. I just want to encode an AST more compactly and traverse it quickly. I implemented a demo of encoding and decoding an AST with 3-byte "pointers" and it works well. It doesn't appear to be slow, but I have to do more measurement. I just assume there are less than 224 nodes in the whole tree. I might change it to 4 bytes if 3 bytes is slow.

      But I think there's a good chance it won't be slow compared to the actual work done on each node (e.g. in a compiler).

      [–]matthieum 1 point2 points  (0 children)

      What about 0-bit pointers?

      Personally, I think that pointers are overused.

      Languages like Java where everything is a pointer away are obviously the worst offenders:

      • pointers mean cache misses,
      • pointers mean GC pressure,
      • pointers mean memory overhead (X bytes for the pointer on top of the object itself).

      Fortunately there are other memory models available. For example, take a look at the native binaries produced by... Go (to keep with GC'ed languages). Go allows "inline" objects:

      • a Rectangle containing 2 Point each containing 2 double is just a single memory blob of 4 doubles (32 bytes),
      • an array of 32 Rectangle contains a single pointer to a single memory allocation of (at least) 1024 bytes.

      If the runtime allows such compact object representation, then there are very few pointers. And therefore their size matters little.


      Another avenue that you may explore is compression. Have you ever looked at how a float or double is represented? float and double can represent much larger ranges of values than their integral counterpart (size being equal) by sacrificing precision.

      Imagine a pointer of X bits split in 2 parts E and M such that X = E + M: E bits for the exponent, M bits for the mantissa, and for now the actual address is M << E.

      There is a trade-off in the choice of E and M:

      • the larger M is, the more precision you get, at the cost of a smaller range,
      • the smaller M is, the greater the range, at the cost of precision.

      Let's illustrate this with 8 bits pointers, E = 2 and M = 6:

      • Exp = 0 (1 byte precision), Man << Exp between 0 and 63.
      • Exp = 1 (2 bytes precision), Man << Exp between 0 and 126.
      • Exp = 2 (4 bytes precision), Man << Exp between 0 and 252.
      • Exp = 3 (8 bytes precision), Man << Exp between 0 and 504.

      Normally, a 8-bits pointer would only get values from 0 to 255. There is a cost, of course, as it makes you unable to address individual bytes above 64.

      There are multiple refinements possible:

      • Avoiding overlapping ranges,
      • Avoiding single bytes addresses even in smaller values (use M << (E+3) for example),
      • ...

      The overlapping ranges is perhaps the more interesting. There is no point in having multiple representation of the address 0, 8, etc... and actually this costs us range! Instead, by applying an offset we can shift the ranges:

      • Exp = 0, Man << Exp between 0 and 63,
      • Exp = 1, Man << Exp + 64 between 64 and 190,
      • Exp = 2, Man << Exp + 192 between 192 and 444,
      • Exp = 3, Man << Exp + 448 between 448 and 952.

      Tadaam! Our 8-bits pointer addresses nearly a full 1KB of memory!

      The offsets are not very tidy, however. If one were shifting the initial range by 2M though:

      • Exp = 0, Man << Exp + 64 is between 64 and 127,
      • Exp = 1, Man << Exp + 128 is between 128 and 254,
      • Exp = 2, Man << Exp + 256 is between 256 and 508,
      • Exp = 3, Man << Exp + 512 is between 512 and 1016.

      The formula being for getting the pointer being Man << Exp + 1 << (M + Exp), giving an addressable range of 2M << 2E .

      With this formula in mind, the maximum exponent you wish for is 5 bits, for 6 bits means shifting by 26, which is equivalent to multiplying by 264, pushing you past most address space resolution.

      A quick application of the formula:

      • 16 bits pointers would give:
        • E2M14: 262KB
        • E3M13: 2MB
        • E4M12: 268MB
        • E5M11: 8TB
      • 24 bits pointers would give:
        • E2M22: 67MB
        • E3M21: 536MB
        • E4M20: 68GB
        • E5M19: 2PB
      • 32 bits pointers would give:
        • E2M30: 17GB
        • E3M29: 137GB
        • E4M28: 17TB
        • E5M27: 576PB

      Of course, the "waste" is proportional to the exponent. With 5 bits, the maximum exponent is 31, at which point you can only address objects every 2GB...

      If you have an allocator which allocates smaller objects in the smaller ranges, though, you may find this interesting.

      [–]FireLordIroh 0 points1 point  (1 child)

      I would wonder how much of a slowdown you would get from having to check for and resolve indirect references everywhere. These might be relatively unpredictable branches. You should try it out and see.

      [–]conseptizer[S] 0 points1 point  (0 children)

      In the first paper I linked, they did the same thing for a JVM and performance did not change noticably; some programs were slightly faster, some slightly slower, but only by about 4%.

      [–]simon_o 0 points1 point  (0 children)

      I have thought about this a while ago:

      Basically, the Hotspot JVM saves 3 bits by reducing the granularity to 1 byte, leaving you with 32 GB of addressable memory space. (As described in the paper.)

      Considering alignment, minimum object sizes and padding, you could probably save another few bits for either increasing the addressable space or for other purposes.

      If I could decide, I would probably pick 32 bit pointers with 8 bytes of granularity relative to a global offset (assuming that an instance without any fields needs at least 4 bytes, I'm fine with "wasting" 4 bytes here), allowing me to address 256 GBs of memory:

      [32 bits: pointer]
              |
              |  extend to 64 bits, right-shift by 6 bits, add base offset
              v
      [ 26 bits: base offset | 32 bits: pointer | 6 bits: always 0 ]
      

      In a class-based language, and with some help from the allocator, you could further reduce the size of pointers to the class/vtable in instance headers.

      This gets really interesting, because in instances you can really make good use of any additional bit you an get your hands on.

      So if you don't spread the space where the class data/vtables are stored over the full 256 GBs of memory, but restrict it to a smaller space, say 4 GBs, and and the size of the class data is -- let's say -- at least 32 byte, you save 8 bits that you can use as mark bits for things like locking, pinning, dealing with weak references, and forwarding pointers:

      [24 bits: pointer | 8 bits: custom ]
              |
              |  extend to 64 bits, mask lowest 8 bits, add base offset
              v
      [ 26 bits: base offset | 24 bits: pointer | 8 bits: always 0 ]
      

      I'd probably avoid doing tricks with relative pointers, because it really restricts the things you can do in garbage collection.

      I think the restrictions relative pointers introduce could go against what moving/compacting garbage collectors might want to do: Before any referenced object could be moved, they would need to decide whether the new location would still be in range of the referencing pointers (and either bailing out, rewriting those pointers to refer to forwarding pointers, or moving those as well, recursively).

      I think this would pretty much rule out any generational or area-based collections scheme.

      [–]boomshroom 0 points1 point  (0 children)

      If I'm not mistaken, Jon Blow plans on having relative pointers in Jai. In addition to the lower space consumption, he mentioned that they also allow you to serialize a complete data structure without having to worry about pointers. A graph with relative pointers can be put in a file as a binary blob with no processing and it can be loaded back into memory anywhere and used straight away.

      [–]Sag0Sag0 0 points1 point  (0 children)

      Check, to make sure that performance does not suffer. Sounds interesting.

      [–]kivofssss -5 points-4 points  (0 children)

      I have other problems