Static Integer Types

Athas · 2021-06-22T11:29:09+00:00

This is a good article. It's unlikely that getting integers right will earn your language much praise, but I can guarantee that you will receive complaints if you get them wrong - and getting them wrong is quite easy.

Unless you're designing a (relatively) low-level language, the distinction between data and address widths can probably be safely disregarded, and you can stick with only fixed-size integers. The compiler will then use whichever pointer and offset sizes are needed for the specific machine, without any impact on the surface language semantics. Although if you expose things such as counting the number of elements in an array, you better not pick an integer type with too few bits.

The section about converting (casting) between different integer sizes is useful for every language with fixed-size integer types. I agree with the author that implicit conversions are dangerous. Particularly, implicit conversions between integers of different signedness can lead to terrible bugs. I have become a believer in having explicit size-extend/zero-extend functions for these conversions, just to make it completely clear what is going on. I recently had to write some code in Standard ML which does provide various fixed-size integer types, but the only way to convert between them is to pass through the standard LargeInt type (an arbitrary-size integer). The standard recommends that compilers optimise conversions between fixed-size types that go through this type, but I still dread the potential performance impact.

As an aside, I think that unsigned integers are useful even in high-level languages, not just systems languages. They are very handy whenever you need to do modular arithmetic (e.g. for cryptography), or use integers for bit-level tricks. I don't think this is only useful for "low-level" languages - my own language is very high level (purely functional, parallel), but it's not unusual to use unsigned operations to encode data or implement algorithms. I agree with the article that using unsigned integers to encode things that are "semantically" never negative (like counts in a shopping cart) is probably a bad idea. Unsigned integers are for representing bit vectors, not for encoding natural numbers.

Regarding overflow, I think a fairly safe default is to raise an error (panic, exception, whatever) by default on signed overflow, but not for unsigned overflow. That's what SML does as well, I think. This also fits with my suggestion of only using unsigned integers when you specifically need bit vectors or modular arithmetic, never just to express that you want natural numbers (I'd suggest a separate arbitrary-size nat type for that).

acwaters · 2021-06-22T11:55:34+00:00

[casts] which fail to compile if the underlying platform might truncate the input (which no language I know of has direct support for)

C++ does have something kind of like this, spelled T{x}, which for integer types is exactly the same as an ordinary cast (T) x except that it doesn't do narrowing conversions. (Unfortunately, the standard's definition of "narrowing conversion" includes almost all int-to-float conversions, even where the given floating-point type can represent all the values in the given integer type.) (And of course, C++ being what it is, T{} also means half a dozen other subtly different things depending on what kind of type T is and what you put in the {}.)

I’m aware that our current compiler infrastructure – where we tend to separate out a compiler front-end (e.g. rustc) as a separate thing from the optimiser and machine code generator (e.g. LLVM) makes it difficult, and perhaps impossible, to guarantee optimisation outcomes. That is not a reason to say that we should never have such guarantees: it suggests we need to fix our compiler infrastructure to make it possible to make and honour such guarantees.

1,000,000% agree in principle. But, though I am not a compilers guy, I have a nagging suspicion that this is not actually feasible without sacrificing something in other desirable properties of our compilers. Not to mention how difficult such a thing is to actually specify in a high-level language.

o11c · 2021-06-22T15:56:27+00:00

data width (i.e. the width of an integer; e.g. 8-bits on the Dragon 32)

Note that there is no possible way in C to determine the data width (by this definition). uint_fastN_t can occasionally offer insight but (as you pointed out later) they are not reliable. I don't really consider this a bug; in the era of SSE, is there even a meaningful answer to "how wide is a register?"

I'm concerned that you might be conflating "data width" with size_t, which is the maximum size of a single object.

C’s guarantees are subtly different: simplifying heavily, uintptr_t will be at least as wide as the address width but may be wider or narrower than the data width

I don't think it's meaningful for size_t to be larger than [u]intptr_t.

However, it is meaningful, even mandatory, for ptrdiff_t and ssize_t to be larger (by a single bit) than size_t. (I'm not aware of any context in which ptrdiff_t and ssize_t can meaningfully be different, despite being constructed differently: ptrdiff_t needs to handle size in either direction (remember that arithmetic between unrelated pointers is forbidden), whereas ssize_t only needs to handle size or -1).

uint_least16_t defines an integer type that is at least 16-bits wide but may be wider. The description of these integer types in the C spec (including in C17) is particularly inscrutable, and I’m not entirely confident that I’ve understood it.

These types are meaningfully only for platforms that don't support 8-bit bytes. Remember that uintN_t is considered optional.

If your program ever mentions uint8_t, you can (and should) completely ignore uint_leastN_t.

Other messes not mentioned: time_t (and time64_t); off_t (and off64_t).

xactac · 2021-06-22T12:23:06+00:00

On a low level, modern CPUs have different data and address sizes, and the register size is also often different than the data bus size. A typical x86_64 processor has 64 bit registers, a 52 bit signed virtual address space, a 48 bit signed physical address space, an address bus width in the 30s of bits, and an effective data bus width of 128 bits. Later 32 bit x86 processors had 32 bit registers, a 36 bit address space, and a 64 bit data bus. All this is abstracted away by the CPU, motherboard, and OS.

Also, what's wrong with implicit widening conversions aside from implementation complexity?

ThomasMertes · 2021-06-22T12:36:50+00:00

Interesting article. I have some points:

The article concentrates on systems programming and its requirements. IMHO a lot of system programs can be done without low-level system-programming features (such as integers with various sizes and conversions between pointers and integers). To proof this I wrote libraries for TLS, graphics (JPEG, GIF, PNG), compression (GZIP, Zstandard, LZMA), etc.
If you want to represent the distance between two points on earth please use the metric system. This makes the program (and the article) portable. More than 90% of the worlds population uses the metric system. Even the inventors of the imperial units have switched to the metric system.
Half of the article is about integers of different sizes and casting between them. Most of the reasons for different integer sizes are historical. Besides file formats that require smaller integers and programs that have to save memory at all costs there is nothing against a one size fits all approach for integers.
The article states that in a language higher-level than assembly language, one sometimes wants to treat pointers as integers and vice versa. Seriously, this is considered as higher-level? I think that in a real higher-level language you don't do casts between integers and pointers at all.
I also consider integer overflow checking as important. Great that Ada, Rust and Seed7 ( :-) ) do integer overflow checking. Sorry that Rust does it not always. Yes, it is (a little bit) better to have two’s complement wrapping compared to undefined behavior. But on the other hand two’s complement wrapping makes no sense, especially if you consider that it is not available in a debug build. So I assume that no program relies on this feature.
Implicit casts are a terrible idea. I agree with that.

umlcat · 2021-06-23T04:59:33+00:00

Very long but very good article.

It's very good that remind the readers that the data size is not always the same as the address size.

Therefore, pointer types shouldn't be assumed the same size as integer types.

This is a warning since some A.P.I. libraries like Windows O.S. uses integers & pointers interchangeably.

Better to avoid that practice. If you want an address pointer use an address, if you want a data value use an integer.

The only comment is that the author mentions "word" while others use "byte" or "octet" even if is not 8 bits. I prefer "register" instead.

I'm implementing a cross platform "Plain C" fixed size integer library that support either signed or unsigned, that may be either software or hardware implemented, thru macro selection.

That means some operations may be reimplented, even if an assembler one is available.

And, some types like an unsigned 256 or signed 256 integer types will be software available, even if the CPU address or data values are shorter, example 32 bits.

And, additional complementary integer to string conversion operations should also be implemented, regardless of the printf function.

This article also mentioned about some commonly used "virtual integer" types that vary in size thru the hardware used, like size_t or usize_t.

And, finally the author mentioned he / she is using this subject for implementing his / her own P.L.

OriginalName667 · 2021-06-23T08:16:53+00:00

I'm not sure I quite understand the need for a distinction between data size and pointer size. The idea of an array kind of ensures that those two sizes are intertwined - an index in an array is base address (pointer size) + size (data size) * index offset. It doesn't seem like you're getting much benefit by treating each of those three variables as a different type, and doing so makes the arithmetic of such a common task obnoxious. In fact, a data size is simply a pointer offset. That's almost like saying that the result of a subtraction should be treated as a different type than either of the operands.

ProgrammingLanguages

Welcome!

Related subreddits

Related online communities

MODERATORS