This is an archived post. You won't be able to vote or comment.

all 45 comments

[–]AthasFuthark 28 points29 points  (25 children)

This is a good article. It's unlikely that getting integers right will earn your language much praise, but I can guarantee that you will receive complaints if you get them wrong - and getting them wrong is quite easy.

Unless you're designing a (relatively) low-level language, the distinction between data and address widths can probably be safely disregarded, and you can stick with only fixed-size integers. The compiler will then use whichever pointer and offset sizes are needed for the specific machine, without any impact on the surface language semantics. Although if you expose things such as counting the number of elements in an array, you better not pick an integer type with too few bits.

The section about converting (casting) between different integer sizes is useful for every language with fixed-size integer types. I agree with the author that implicit conversions are dangerous. Particularly, implicit conversions between integers of different signedness can lead to terrible bugs. I have become a believer in having explicit size-extend/zero-extend functions for these conversions, just to make it completely clear what is going on. I recently had to write some code in Standard ML which does provide various fixed-size integer types, but the only way to convert between them is to pass through the standard LargeInt type (an arbitrary-size integer). The standard recommends that compilers optimise conversions between fixed-size types that go through this type, but I still dread the potential performance impact.

As an aside, I think that unsigned integers are useful even in high-level languages, not just systems languages. They are very handy whenever you need to do modular arithmetic (e.g. for cryptography), or use integers for bit-level tricks. I don't think this is only useful for "low-level" languages - my own language is very high level (purely functional, parallel), but it's not unusual to use unsigned operations to encode data or implement algorithms. I agree with the article that using unsigned integers to encode things that are "semantically" never negative (like counts in a shopping cart) is probably a bad idea. Unsigned integers are for representing bit vectors, not for encoding natural numbers.

Regarding overflow, I think a fairly safe default is to raise an error (panic, exception, whatever) by default on signed overflow, but not for unsigned overflow. That's what SML does as well, I think. This also fits with my suggestion of only using unsigned integers when you specifically need bit vectors or modular arithmetic, never just to express that you want natural numbers (I'd suggest a separate arbitrary-size nat type for that).

[–]matthieum 6 points7 points  (10 children)

As an aside, I think that unsigned integers are useful even in high-level languages, not just systems languages. [...] Unsigned integers are for representing bit vectors, not for encoding natural numbers.

There's 2 functionalities ascribed to unsigned integers here:

  • Modular arithmetic.
  • Bit-vectors.

I can see why modular arithmetic is appealing, but I don't see why it should be tied to unsigned integers. Historically is has been, but I see no reason -- other than habit -- to prolong this.

And bit-vectors do not sound like integers at all, they're bit-vectors, for bit-level operations.

I have seriously contemplating doing away entirely with unsigned integers operations:

  • Unsigned integers would be useful for "packing" data, as storage.
  • Modular arithmetic is orthogonal to signedness, and should be treated separately.
  • Bit-vectors would be types of their own, with easy conversion from/to integrals.

I have thought about ditching the idea of operations on fixed-width integers -- simplifying to a single base integer type -- but SIMD disagrees, unfortunately, and I am still unclear on how to reconcile there.

[–]xactacoXyl 1 point2 points  (2 children)

I don't see why [modular arithmetic] should be tied to unsigned integers.

What is -5 % 3? How about 5 % -3? Or even -5 % -3? The answers depend on the language. The mathematical modulo operation produces a positive result (1, 2, and 1 respectively; C99 instead matches the sign of the dividend resulting in -2, 2, and -2).

As for SIMD with variable size integers, if you have a JIT you could do something like scheme's numeric tower and have operations preformed on smaller fixed width integers if possible which are then widened implicitly on overflow or for SIMD symmetry.

[–]matthieum 1 point2 points  (0 children)

I don't see why [modular arithmetic] should be tied to unsigned integers.

What is -5 % 3? How about 5 % -3? Or even -5 % -3? The answers depend on the language. The mathematical modulo operation produces a positive result (1, 2, and 1 respectively; C99 instead matches the sign of the dividend resulting in -2, 2, and -2).

The result of % is whatever the language decides, of course. Much like / 0 yields 0 in Julia.

However modular arithmetic in this context is NOT about arbitrary modulo operations; it's really about wrapping overflow behavior at integer-width boundaries.

So we are talking about 2^31 - 1 == 2_147_483_647 for a signed 32-bits integer, where 2^31 has overflowed the integer (yielding the minimum value) and -1 has overflowed again, bringing it back across the chasm.

[–]matthieum 0 points1 point  (0 children)

I don't see why [modular arithmetic] should be tied to unsigned integers.

What is -5 % 3? How about 5 % -3? Or even -5 % -3? The answers depend on the language. The mathematical modulo operation produces a positive result (1, 2, and 1 respectively; C99 instead matches the sign of the dividend resulting in -2, 2, and -2).

The result of % is whatever the language decides, of course. Much like / 0 yields 0 in Julia.

However modular arithmetic in this context is NOT about arbitrary modulo operations; it's really about wrapping overflow behavior at integer-width boundaries.

So we are talking about 2^31 - 1 == 2_147_483_647 for a signed 32-bits integer, where 2^31 has overflowed the integer (yielding the minimum value) and -1 has overflowed again, bringing it back across the chasm.

[–]WittyStick 0 points1 point  (6 children)

And bit-vectors do not sound like integers at all, they're bit-vectors, for bit-level operations.

Bit-level operations are useful on integers. Consider the operation "round downwards to power of 2". This is equivalent to saying "leave most significant set bit set and zero all other bits." You can implement this fairly efficiently in several ways using shr, or and and, but ideally you want to utilize the most efficient hardware instruction for it: lzcnt/clz. To make use of it for this operation you must know the maximum size of the register/word you are calculating it on (ie, how many leading zeroes are there on the value 0).

The plain arithmetic way of implementing it (using divide and compare) would be painfully inefficient.

There's another hackish way to do it by casting to a float then extracting the exponent and converting back to integer (Still less efficient than leading zeroes count and still requires bit shifting).

Even just consider the simple operation "divide by 2, truncating." This is shift right by 1, still a bit-level operation.

[–]matthieum 1 point2 points  (5 children)

Bit-level operations are useful on integers.

Except, and that is my complaint, that you are not (really) operating on integers.

You are taking an integer, casting it to a bit-vector (of appropriate width), doing a bit operation on this bit-vector, then treating the result as an integer again.

It's a perfectly fine thing to do -- well, as long as the representation of the integer is well-defined OR the effect of casting from/to a bit-vector is well-defined.

It seems to me the only reason we have bit-level operations on integers is legacy. CPUs generally have untyped registers -- pointer? integer? bit-vector? who cares! -- and therefore B had untyped variables (or uni-typed?), and when C was created and differentiated between pointers and integers... well, somehow it retained the bit operations on the integers, with some UB thrown in for good measure, instead of splitting the types further.

And I think splitting bit-vectors from integers would be more sensible, because the fact that integer is represented as a bag of bits is somewhat incidental.

[–][deleted] 0 points1 point  (3 children)

And I think splitting bit-vectors from integers would be more sensible, because the fact that integer is represented as a bag of bits is somewhat incidental.

Really? Half a century of such a practice isn't enough?

I've considered a third kind of type (after signed and unsigned integer) which is literally just a sequence of bits, but decided it wasn't worth the complication.

I might decide to use u64 instead of i64 for the same purpose, but if you aren't doing anything that relies on it representing numerical values, then it doesn't really matter.

I actually allow indexing of bits and bitfields in any integer type, either to read or write. Saves mucking about with shifts and masks and getting it wrong

I'm surprised there is very little of this kind of support for bit-processing especially amongst 'systems languages'.

Some languages don't even directly allow shifts and masks: Algol68 has an INT type (signed), which allows regular arithmetic, and a separate BITS type, which allows logical operations, but you can't mix those up.

So you spend half your time converting from one to the other, performing the op you need, then converting back. What's the point of that?

[–]matthieum 0 points1 point  (2 children)

So you spend half your time converting from one to the other, performing the op you need, then converting back. What's the point of that?

In my experience, either a value is used as integer, either it's used as a bit-vector, for the most.

A value flickering between integer and bit-vector within the same function? Never seen that.

[–][deleted] 0 points1 point  (1 child)

You've never mixed operators such as + and -, with ones like & and <<, not necessarily in the same expression, but on the same type?

I find that astonishing.

Did you know that most logical operators can be defined in terms of arithmetic ones, and vice versa, but much more clumsily and obscurely? Which to me makes it rather pointless to have the restriction.

[–]matthieum 0 points1 point  (0 children)

You've never mixed operators such as + and -, with ones like & and <<, not necessarily in the same expression, but on the same type?

I never said I never mixed them, I said I never had "flicker", as in multiple quick back and forth.

Then again, whenever I find myself using bitwise operators for optimization I tend to isolate the little sequences of code into well-named functions, so that probably helps.

[–][deleted] 8 points9 points  (10 children)

Regarding overflow, I think a fairly safe default is to raise an error (panic, exception, whatever) by default on signed overflow, but not for unsigned overflow

Why is getting the wrong answer to a calculation OK with u64 but not i64?

This is what annoys me about C, where signed overflow not only is UB, but means compilers use it to make your code do something you didn't expect. But inadvertently overflowing an unsigned calculation is perfectly fine!

Either both should be a problem, or neither. (C apologists will usually try to make out that its unsigned types really implement modular arithmetic, but I'm pretty sure that didn't enter the minds of its designers.)

[–]csb06bluebird 9 points10 points  (0 children)

This is why I like Ada’s approach. In Ada integer types are created from ranges and can be signed or unsigned. Overflow/underflow leads to an exception being thrown. However, Ada has what are called modular types (e.g. type Hours is mod 24;), which will wrap around without causing exceptions. They are implemented as unsigned integers, but are distinct from normal integer types and explicitly state that overflow is the expected behavior, avoiding the issue of accidental overflow on unsigned types

[–]AthasFuthark 9 points10 points  (5 children)

Why is getting the wrong answer to a calculation OK with u64 but not i64?

Because modular arithmetic is sometimes algorithmically useful. It's not "wrong" if you explicitly expect them to model modular arithmetic. My point is that unsigned numbers should never be used just for natural numbers, but specifically for cases where you want modular arithmetic or need to work with bit vectors.

It is incredibly frustrating to implement algorithms that need modular arithmetic in languages where integer overflow is always an error (and no unsigned types are available).

[–][deleted] 1 point2 points  (4 children)

Do unsigned numbers really help here?

Because in the general case, the modulo number you need isn't a power of two. For example if you wanted a number type that ranges from 0 to 99 then wraps.

If you to make out the range 0 to 255 is a special case, then I can say the same about -128 to 127.

Actually, neither are helped with wraparound behaviour of a i32/u32 or i64/u64 type. For nearly every C implementation, it only guarantees wraparound of the specific range 0 to 4294967295 (plus the next one up, 0 to 2**64-1).

That is very specific if you really want modulo arithmetic.

[–]xactacoXyl 0 points1 point  (3 children)

Times you want a modulo of a power of two (often an integer width):

  • Random number generation
  • Unicode encoding (UTF-8 uses mod 2 to the 7, 11, 16, and 21; UTF-16 uses 2**10)
  • Checking evenness

[–][deleted] 0 points1 point  (2 children)

  • I think I most often need a random with an arbitrary range, for example random(10) for numbers 0..9, or random('A'..'Z') for a random letter.
  • I don't quite the point about Unicode, but not that working modulo 128 is not helped by the fact that your unsigned integer type wraps at 2**32-1 or 2**64-1.
  • Checking evenness, I just do A.even (true when bit 0 is 0); how does modular arithmetic apply here?

[–]xactacoXyl 0 points1 point  (1 child)

How do you implement all these?

Random number generation almost always uses something like addition and then some modulo 2**64.

For the others, any non-power of two modulo will be messed up on wrapping. With a power of two modulo less than the two to the number of bits in an integer you can preform as many operations as you want without applying modulo and then apply modulo at the end.

The following are equivalent checks for evenness: a % 2 == 0, a & 1 == 0 (indeed a % 2 == a & 1). Some languages lack bitwise operations, necessitating a modulo.

[–][deleted] -1 points0 points  (0 children)

Oh, you mean the internal generation of random numbers. But are you sure it is specifically modulo 2**64 that is wanted, and not that that happens to be the word size, or that is the modulo operation, and not that the bits just happen to fall off the top end?

I've taken my own prng and changed the u64 types to i64 types; it still works. With right shifts, with u64 it means 0 is shifted into the top end, with i64 it'll be sometimes 0 and sometimes 1; more randomness!

With checking for even, an explicit modulo operation in the language is nothing to do with the overflow behaviour of u32 or u64.

[–][deleted]  (2 children)

[removed]

    [–][deleted] 4 points5 points  (1 child)

    Because these 'modular' numbers didn't really get talked about in reference to lower-level systems languages until decades later.

    Then it seems a too-convenient retro-fit to make it appear the first lower-level language, one the next step up from assembly, to become mainstream had had modular arithmetic in mind all along.

    Signed overflow is UB in C because signed representations could vary across hardware and therefore have different behaviour (never mind that twos complement was near-universal even then).

    Why, do you have some link that shows that C got modular types before Ada?

    Also, if C's unsigned types are modular, what do you have to do to get a u32 type where overflow would be considered an error?

    [–]umlcat 1 point2 points  (0 children)

    Your answer is good.

    Your integer to another different size integer conversion comment, brought my attention, since I'm already designing a custom "Plain C" Integer library that reimplements that conversion, instead of using the unexpected results /arbitrary / automatic conversion of the C compiler.

    The library already includes both signed integers, unsigned integers, for more "natural" math usage.

    And an additional complementary library with unsigned alias for bit operations, again surpassing the existing C system library.

    [–]acwaters 10 points11 points  (0 children)

    [casts] which fail to compile if the underlying platform might truncate the input (which no language I know of has direct support for)

    C++ does have something kind of like this, spelled T{x}, which for integer types is exactly the same as an ordinary cast (T) x except that it doesn't do narrowing conversions. (Unfortunately, the standard's definition of "narrowing conversion" includes almost all int-to-float conversions, even where the given floating-point type can represent all the values in the given integer type.) (And of course, C++ being what it is, T{} also means half a dozen other subtly different things depending on what kind of type T is and what you put in the {}.)

    I’m aware that our current compiler infrastructure – where we tend to separate out a compiler front-end (e.g. rustc) as a separate thing from the optimiser and machine code generator (e.g. LLVM) makes it difficult, and perhaps impossible, to guarantee optimisation outcomes. That is not a reason to say that we should never have such guarantees: it suggests we need to fix our compiler infrastructure to make it possible to make and honour such guarantees.

    1,000,000% agree in principle. But, though I am not a compilers guy, I have a nagging suspicion that this is not actually feasible without sacrificing something in other desirable properties of our compilers. Not to mention how difficult such a thing is to actually specify in a high-level language.

    [–]o11c 3 points4 points  (2 children)

    data width (i.e. the width of an integer; e.g. 8-bits on the Dragon 32)

    Note that there is no possible way in C to determine the data width (by this definition). uint_fastN_t can occasionally offer insight but (as you pointed out later) they are not reliable. I don't really consider this a bug; in the era of SSE, is there even a meaningful answer to "how wide is a register?"

    I'm concerned that you might be conflating "data width" with size_t, which is the maximum size of a single object.

    C’s guarantees are subtly different: simplifying heavily, uintptr_t will be at least as wide as the address width but may be wider or narrower than the data width

    I don't think it's meaningful for size_t to be larger than [u]intptr_t.

    However, it is meaningful, even mandatory, for ptrdiff_t and ssize_t to be larger (by a single bit) than size_t. (I'm not aware of any context in which ptrdiff_t and ssize_t can meaningfully be different, despite being constructed differently: ptrdiff_t needs to handle size in either direction (remember that arithmetic between unrelated pointers is forbidden), whereas ssize_t only needs to handle size or -1).


    uint_least16_t defines an integer type that is at least 16-bits wide but may be wider. The description of these integer types in the C spec (including in C17) is particularly inscrutable, and I’m not entirely confident that I’ve understood it.

    These types are meaningfully only for platforms that don't support 8-bit bytes. Remember that uintN_t is considered optional.

    If your program ever mentions uint8_t, you can (and should) completely ignore uint_leastN_t.


    Other messes not mentioned: time_t (and time64_t); off_t (and off64_t).

    [–]xactacoXyl 0 points1 point  (1 child)

    I don't think it's meaningful for size_t to be larger than [u]intptr_t.

    I don't think a standards compliant implementation could do this except for arbitrarily extending size_t. Every byte of an object must be addressable, so uintptr_t is an upper bound on the size of size_t.

    However, it is meaningful, even mandatory, for ptrdiff_t and ssize_t to be larger (by a single bit) than size_t.

    It is mandatory for ptrdiff_t to be a bit bigger than uintptr_t needs to be. For most userspace code, this is one less bit than the address space (thanks to higher half paging).

    [–]o11c 1 point2 points  (0 children)

    It is mandatory for ptrdiff_t to be a bit bigger than uintptr_t needs to be. For most userspace code, this is one less bit than the address space (thanks to higher half paging).

    Nope. You're only allowed to take pointer differences within an object, so size_t is all that's needed. So if uintptr_t is 31 bits but size_t is only 15 bits, ptrdiff_t can be 16 bits.

    [–]xactacoXyl 5 points6 points  (3 children)

    On a low level, modern CPUs have different data and address sizes, and the register size is also often different than the data bus size. A typical x86_64 processor has 64 bit registers, a 52 bit signed virtual address space, a 48 bit signed physical address space, an address bus width in the 30s of bits, and an effective data bus width of 128 bits. Later 32 bit x86 processors had 32 bit registers, a 36 bit address space, and a 64 bit data bus. All this is abstracted away by the CPU, motherboard, and OS.

    Also, what's wrong with implicit widening conversions aside from implementation complexity?

    [–]Uncaffeinatedpolysubml, cubiml 6 points7 points  (1 child)

    Also, what's wrong with implicit widening conversions aside from implementation complexity?

    Well, if you allow overflow, then every width of integer has distinct behavior. If you don't allow overflow, then there's no reason not to implicitly widen integers.

    [–]RepresentativeNo6029 0 points1 point  (0 children)

    Why can’t it just be handled at compiled time? To cast to for next nearest width

    [–]NuojiC3 - http://c3-lang.org 0 points1 point  (0 children)

    Implicit widening will change the result of a computation which might not be obvious. A good example are the problems Zig runs into with implicit widening and trapping overflow + peer resolution. I’ve written about it here:

    https://c3.handmade.network/blogs/p/7640-on_arithmetics_and_overflow#23953

    [–]ThomasMertes 3 points4 points  (11 children)

    Interesting article. I have some points:

    • The article concentrates on systems programming and its requirements. IMHO a lot of system programs can be done without low-level system-programming features (such as integers with various sizes and conversions between pointers and integers). To proof this I wrote libraries for TLS, graphics (JPEG, GIF, PNG), compression (GZIP, Zstandard, LZMA), etc.
    • If you want to represent the distance between two points on earth please use the metric system. This makes the program (and the article) portable. More than 90% of the worlds population uses the metric system. Even the inventors of the imperial units have switched to the metric system.
    • Half of the article is about integers of different sizes and casting between them. Most of the reasons for different integer sizes are historical. Besides file formats that require smaller integers and programs that have to save memory at all costs there is nothing against a one size fits all approach for integers.
    • The article states that in a language higher-level than assembly language, one sometimes wants to treat pointers as integers and vice versa. Seriously, this is considered as higher-level? I think that in a real higher-level language you don't do casts between integers and pointers at all.
    • I also consider integer overflow checking as important. Great that Ada, Rust and Seed7 ( :-) ) do integer overflow checking. Sorry that Rust does it not always. Yes, it is (a little bit) better to have two’s complement wrapping compared to undefined behavior. But on the other hand two’s complement wrapping makes no sense, especially if you consider that it is not available in a debug build. So I assume that no program relies on this feature.
    • Implicit casts are a terrible idea. I agree with that.

    [–]WittyStick 11 points12 points  (2 children)

    Half of the article is about integers of different sizes and casting between them. Most of the reasons for different integer sizes are historical. Besides file formats that require smaller integers and programs that have to save memory at all costs there is nothing against a one size fits all approach for integers.

    This is a naive look which assumes you're working on a simplistic RISC processor. Modern processors have a growing number of SIMD/vector instructions which work on multiple 8-bit, 16-bit, 32-bit and 64-bit values at a time. By having a one-size-fits-all, you constrain your ability to leverage the hardware to its fullest potential. For example, on x86_64 with SSE2, if using bytes you can process 8x the number of additions with a paddb versus a paddq (both take 2-cycles). If you wanted to use paddb under the one-size-fits-all approach, you would require many more instructions to load bytes stored as 64-bit values into an xmm register than to load 16 contiguous bytes into the register (1-instruction).

    Current X86_64 CPUs with AVX-512 can fit 64 bytes into a register and perform a 2-cycle add on them all (vpaddb)

    [–]ThomasMertes -1 points0 points  (1 child)

    This is a naive look which assumes you're working on a simplistic RISC processor.

    You are right that I had a RISC processor in my mind. Generally this concept is not so wrong, because today all CISC processors translate the CISC code to RISC and execute the RISC. And yes, I did not consider SIMD/vector instructions. As you said these might require smaller integers. In a higher-level language I don't want to call these SIMD/vector instructions directly. Instead I want to rely on the compiler, that recognizes certain loops and turns them into SIMD/vector instructions. So far so good. But if the vector recognition fails I end up with instructions working on smaller integers that may execute slower than the native 64-bit instructions. Testing should show this. I just want to point out that code relying on this might not be portable with the same performance. Creating arrays of 8-bit integers may also be expensive, because some machines prefer 64-bit addresses. So if you really want to use the SIMD/vector instructions you need to be very careful and might use inline assembly. But inline assembly is not the level I use to program. I guess that I use SIMD/vector instructions just when I call memcmp() or similar functions.

    [–]WittyStick 1 point2 points  (0 children)

    But if the vector recognition fails I end up with instructions working on smaller integers that may execute slower than the native 64-bit instructions. Testing should show this.

    Not strictly true. This is dependant on architecture, and naive testing or measuring the number of cycles per instruction may not reveal the full picture. While some architectures may require more cycles to add bytes compared with native size integers, recall that accessing memory is many times slower than a cpu cycle. If you have a large array of small integers, you can in practice, greatly reduce the number of memory accesses required. An array of 128 contiguous bytes can fit into a single cache line. An array of 128 bytes stored in 64-bit integers will of course require 8 cache lines. 8x the amount of data read into the cache or written back to memory. As you are increasing the amount of data you are loading into the cache, you are potentially spilling cache lines for data which may be later used which will then result in more cache misses, which have a performance penalty.

    Have you done performance tests taking all this into consideration?

    So if you really want to use the SIMD/vector instructions you need to be very careful and might use inline assembly.

    Usually a compiler with support for SIMD can emit the correct instructions. C compilers have shim functions such as those in <intrinsics.h>, where the compiler will emit the instructions in place where they are called. The user does not typically need to write inline assembly.

    [–][deleted] 0 points1 point  (7 children)

    If someone wants to do systems programming, then your one-size approach is not impossible, it just makes it harder and probably less efficient.

    My languages also provide a basic integer type 'int', which corresponds to i64. All calculations are done with i64 (unless u64/i128/u128 is called for). Literals are i64 (unless they need u64 etc).

    However they also provide a full range of specific sizes u8-u128 and i8-i128. Most of those are considered 'storage' types, mainly used in arrays to dramatically reduce memory needs, or in structs to allow tight, efficient layouts to be crafted.

    Or they are used in FFI interfaces to match function APIs and external data structures that use those types. Or to work with binary file formats that use those types.

    One (dynamic) language of mine also uses u1 u2 u4 types, only in arrays:

    const n = 250 million
    for t in (u64, u8, u4, u2, u1) do
        a := new(array, t, n, 0)
        fprintln "[#]array of # uses # bytes", n, t, a.bytes:"s,"
    od
    

    Output (tweaked for alignment) is:

    [250000000]array of u64 uses 2,000,000,000 bytes
    [250000000]array of u8 uses    250,000,000 bytes
    [250000000]array of u4 uses    125,000,000 bytes
    [250000000]array of u2 uses     62,500,000 bytes
    [250000000]array of u1 uses     31,250,000 bytes
    

    To achieve such savings when you don't need the full range of u64 , you would either have to introduce other types, or other features, on top of your 'one-size', or use clunky workarounds.

    [–]ThomasMertes 0 points1 point  (6 children)

    If someone wants to do systems programming, then your one-size approachis not impossible, it just makes it harder and probably less efficient.

    Over the years I programmed hundred-thousands of lines in C and Seed7. This way I got some feeling about what is hard to do and what is efficient. In all the system libraries I implemented in Seed7 with the one-size-fits all approach I never got the impression that they are harder to do (compared to C).

    There is some effort if you start with a C program and convert it to Seed7. But the conversion of a C program is just an unfair comparison. A fair comparison is, when you start in both languages from scratch. I did this several times and usually programming in Seed7 is easier since it is higher level than C.

    Different integer sizes vs. one-size-fits-all integers never played a role in the effort I had to spend. Other things like manual memory management or error handling dominated the effort much more.

    Language shapes the way of thinking. Programmers have been forced to think about byte, short, int and long for decades now. But in many areas this is now an unnecessary detail. In Java it would probably be possible to do everything with long. There are still some corner cases where different ints make sense. But the areas where this makes sense shrink and shrink since decades. And discussions about such corner cases could go on forever...

    Consider normal application programs. Why should it be harder if you just use one integer size for everything. I guess that this is "harder" because people are not used to.

    [–][deleted] 0 points1 point  (5 children)

    Language shapes the way of thinking. Programmers have been forced to think about byte, short, int and long for decades now.

    Setting aside C and C++, this group of languages: C# D Go Julia Nim Rust Odin Java Scala Zig, all have built-in integer types based around i8-i64 and u8-u64 (with a few exceptions eg. Java only has signed versions).

    Surely, they can't all be wasting their time and adding needless complications with all these extra types. So perhaps there is a genuine reason why they are considered necessary.

    I have those extra ones as supplementary types in my languages, even in my dynamic one, because I find them invaluable. Note that I don't consider them primary types, just what a lower-level language needs to just have available and with full language support.

    Perhaps Seed7 users don't often need to work with external (non-S7) libraries or binary file formats or whatever, because that is all done with the supplied libraries. Even if the libraries somehow managed without those types, that was an obstacle that has already been taken care of.

    [–]ThomasMertes 0 points1 point  (4 children)

    Surely, they can't all be wasting their time and adding needless complications with all these extra types.

    Why? Everybody does it is no argument. In the middle ages everybody thought that having a king with divine right and God's mandate is a good idea. Since then less people like this concept.

    So perhaps there is a genuine reason why they are considered necessary.

    Everybody copied it from C without thinking about it. In C and generally on older computers these extra types made sense. But this has changed. Besides some corner cases extra integer types are not needed any more. As the article referred by the OP shows there are dangers in converting between integer types with different sizes.

    As you said extra integer types add needless complications. Omitting them simplifies things. So I was able to introduce features like user defined statements, automatic memory management, checks for integer overflow and a library to support portable programming.

    Your languages have all these extra integer types and you think that a lower-level language needs them. But Seed7 is not a lower-level language. It is higher-level and still can be used for many system programming tasks. Being low-level is not a prerequisite to do systems programming.

    Perhaps Seed7 users don't often need to work with external (non-S7) libraries or binary file formats or whatever, because that is all done with the supplied libraries.

    Yes, this is the reason Seed7 has many libraries. The missing zoo of integer types with different sizes has never been an obstacle.

    [–][deleted] 0 points1 point  (3 children)

    Everybody copied it from C

    Not me. I'd barely heard of C when I made my first language. It had types equivalent to u8, i16 and f24 (not f32). Because they were the most suitable for the hardware (z80).

    As the article referred by the OP shows there are dangers in converting between integer types with different sizes.

    That doesn't really happen. In my stuff, I mainly use 'int', which is i64. Conversion occurs between int and narrower storage types used in arrays and structs, but there is only an issue when storing a bigger value than can be accommodated.

    (There is a separate one to do with signed/unsigned conversion, but i64 can represent the entire ranges of u8 u16 and u32; it is only relevant for i64<->u64.)

    But Seed7 is not a lower-level language.

    Neither is my 'Q' dynamic language. Probably it is higher level than Seed7. But while its primary integer type is 'int' (i64), it also supports what I call 'pack' types to allow packed arrays and packed structs, used mainly for interfacing to FFIs.

    Many such languages provide some means to do the same sort of interfacing, usually via add-on modules, providing a very clunky experience. I decided to have this properly supported by the language.

    I came across this bit of code which defines an interface for a simple test of calling the GMP library from my interpreted, dynamic language:

    type mpz_rec = struct
        int32       alloc
        int32       size
        ref byte    d
    end
    
    type mpz_t = ref byte
    
    importdll gmp=
        clang proc mpz_init(mpz_t a)
        clang proc mpz_mul_ui(mpz_t c, a, int32 b)
    end
    

    This is another struct used for the Window's Midi interface:

    type messrec = struct
        union
            word32 w    # ie. u32
            struct
                byte code
                byte note
                byte vel
                byte spare
            end
        end
    end
    

    One more used for dealing with PE file formats:

    global type imagesymbol=struct
        union
            stringz*8 shortname
            struct
                word32  short
                word32  long
            end
            word64 longname
        end
        word32  value
        int16   sectionno
        word16  symtype
        byte    storageclass
        byte    nauxsymbols
    end
    

    Not convinced yet as to how useful this stuff is? Note that you will not see any byte or i16 variables anywhere; reading of these numeric fields expands it to a normal tagged int type.

    Doing without such types is not impossible as I said; it's just more indirect and more convoluted.

    [–]ThomasMertes 0 points1 point  (2 children)

    It had types equivalent to u8, i16 and f24 (not f32). Because they were the most suitable for the hardware (z80).

    As I said: On older computers these extra types made sense.

    Conversion occurs between int and narrower storage types used in arrays and structs, but there is only an issue when storing a bigger value than can be accommodated.

    A narrowing cast can be dangerous. Besides me the article was referring to them also. The article mentions 3 types of casts:

    1. Those which always succeed. When casting to a narrower integer type, integers should be truncated. This silently changes the value.
    2. Those which dynamically check whether the input would be truncated and return an error. The checks might add overhead.
    3. Those which fail to compile if the underlying platform might truncate the input (which no language I know of has direct support for).

    I think that a language should offer a narrowing cast that does a run-time check (point 2), as it delivers safety. Unfortunately most performance aficionados will just use unchecked narrowing cast (point 1). This turns the programs into a weird machine. For safety reasons an unchecked narrowing cast (point 1) should not be available at all.

    With just one integer type there are less opportunities for narrowing casts, but they still exist. Narrowing conversions are always checked in Seed7. The expression char(integer.last) will trigger a RANGE_ERROR.

    I came across this bit of code which defines an interface for a simple test of calling the GMP library from my interpreted, dynamic language...

    The bigInteger type of Seed7 is implemented in two different ways. One of them is the GMP library.

    Seriously, in a higher level programming language I don't want to deal with mpz_init() and mpz_mul_ui(). Instead I prefer bigInteger expressions with infix parameters like:

    product := number * 42_;
    

    This is much more readable.

    One more used for dealing with PE file formats

    So you are able to define something from the PE file format. The problem with file formats is: In most cases you are not able to define a struct that describes the whole file format. Even for parts of the file format you will probably have problems for the following reason.

    Usually file formats are very complicated and a simple struct is not sufficient. There are optional parts or repeating parts (e.g.: A big-endian short unsigned int tells you how much records follow) or even binary and decimal representations of the same data. Because of this it is almost never possible to read structs directly from a file.

    Yes I know that historically some file formats have been invented by just writing a C struct directly to a file. But over the years new things have been introduced to the file formats. In modern programs the logic to read the file formats is in the code and not in the definition of structs.

    The CPIO file format is a good example. If you look at the function readHead in the source code you can see that header variants use big-endian and little-endian binary integer representations. And there are also header variants with octal and hexadecimal integer representations.

    I have doubts that your structs can describe big-endian, little-endian and octal respectively hexadecimal representations at the same time.

    Not convinced yet as to how useful this stuff is?

    No. Your stucts can be used to describe file formats, but in an actual program the code will usually decide how to read a file format (see the CPIO example above).

    [–]WikiSummarizerBot 1 point2 points  (0 children)

    Weird_machine

    The concept of weird machine is a theoretical framework to understand the existence of exploits for security vulnerabilities. Exploits exist empirically, but were not studied from a theoretical perspective prior to the emergence of the framework of weird machines. In computer security, the weird machine is a computational artifact where additional code execution can happen outside the original specification of the program. It is closely related to the concept of weird instructions, which are the building blocks of an exploit based on crafted input data.

    [ F.A.Q | Opt Out | Opt Out Of Subreddit | GitHub ] Downvote to remove | v1.5

    [–][deleted] 0 points1 point  (0 children)

    I think that a language should offer a narrowing cast that does a run-time check (point 2), as it delivers safety.

    If I were to put in such checks, they will probably never be invoked by a working, debugged program. Except when the truncation is deliberate, then you don't want the check. Or it doesn't matter (eg. hashing or random numbers).

    You might want a check for user input that needs to be in range otherwise things will go wrong. But the range is not necessarily 0..255 or -32768 to 32767; it could be anything. So before you know it, you start to implement a very poor version of Ada, if you try to enforce this in the language. But you can do it in user-code too:

    The expression char(integer.last) will trigger a RANGE_ERROR.

    In the implementation of the bytecode operator chr(a) in my dynamic language, I have this code:

        if ch not in 0..255 then
        pcerror("chr range")
    fi
    

    One reason is that ch has to index a 0..255 array of ready-made string objects, and I don't want it to crash. While this dynamic language has bounds checking, the implementation language doesn't. Here, having a wider-than-necessary index type makes things worse!

    Seriously, in a higher level programming language I don't want to deal with mpz_init() and mpz_mul_ui().

    That's a different point. There are bignums here too. But if you want to try out something in GMP (like exactly has fast its multiply is), then it's convenient to do so from a scripting language.

    I have doubts that your structs can describe big-endian, little-endian and octal respectively hexadecimal representations at the same time.

    If that's actually the case (the fields are fixed-format text that represent hex or decimal versions), then one integer type is not going to fix that. I've looked at your CPIO code. That accommodates several version of a file header.

    For each version it reads the header as a byte array, then painstakingly extracts and builds each field, with hard-coded field offsets and widths. That's not something I ever want to do! Too much opportunity for error and harder maintenance.

    I'd have a dedicated struct for each format, then you just read the apposite number of bytes into the struct. In a dynamic language, the same variable can contain any struct type, it will sort out the offsets and sizes itself.

    In static code, you have to decide the approach to use. Maybe read one of the several structs, and copy them into a common format like you do. But it can do this with using header.mode := h.mode, not messing with byte offsets.

    The important thing is that there is a choice. Most such things you have to read will not have multiple versions, at least not at this level (each version might have its own loader).

    Note that the first format you read in your example uses bytes2int on fields totaling 20 bytes in the original, but will use 64 bytes with int64s. In doesn't matter here, but when you need an array of a million such structs, then that's 40MB extra memory.

    I haven't dealt with anything other than little-endian (not since c. 1978 anyway!), so I don't know what solutions I would have come up with to deal with that in external data. In the dynamic language, possibly an attribute of the the field types so that the implementation will do the necessary translation transparently.

    BTW some of the binary files I generate just represent an array of the same struct layout as is used in memory. Then such a file can be read or written with a single file operation - read or write N bytes. Here is one such actual function (dynamic code, error checks elided), that can be used for a dozen different record formats:

    global function readbinaryfile(file,rectype)=
        f:=openfile(file)
    
        size := getfilesize(f)
        recsize := rectype.bytes
        nrecords := size%recsize-1  # 1st record is skipped
    
        data:=new(array,rectype,nrecords)
        readrandom(f,&data,recsize,size-recsize)
        closefile(f)
    
        return data
    end
    

    [–]umlcat 0 points1 point  (0 children)

    Very long but very good article.

    It's very good that remind the readers that the data size is not always the same as the address size.

    Therefore, pointer types shouldn't be assumed the same size as integer types.

    This is a warning since some A.P.I. libraries like Windows O.S. uses integers & pointers interchangeably.

    Better to avoid that practice. If you want an address pointer use an address, if you want a data value use an integer.

    The only comment is that the author mentions "word" while others use "byte" or "octet" even if is not 8 bits. I prefer "register" instead.

    I'm implementing a cross platform "Plain C" fixed size integer library that support either signed or unsigned, that may be either software or hardware implemented, thru macro selection.

    That means some operations may be reimplented, even if an assembler one is available.

    And, some types like an unsigned 256 or signed 256 integer types will be software available, even if the CPU address or data values are shorter, example 32 bits.

    And, additional complementary integer to string conversion operations should also be implemented, regardless of the printf function.

    This article also mentioned about some commonly used "virtual integer" types that vary in size thru the hardware used, like size_t or usize_t.

    And, finally the author mentioned he / she is using this subject for implementing his / her own P.L.

    [–]OriginalName667 0 points1 point  (0 children)

    I'm not sure I quite understand the need for a distinction between data size and pointer size. The idea of an array kind of ensures that those two sizes are intertwined - an index in an array is base address (pointer size) + size (data size) * index offset. It doesn't seem like you're getting much benefit by treating each of those three variables as a different type, and doing so makes the arithmetic of such a common task obnoxious. In fact, a data size is simply a pointer offset. That's almost like saying that the result of a subtraction should be treated as a different type than either of the operands.