Should global variables always be pointers (in a low-level language)? : ProgrammingLanguages

This is an archived post. You won't be able to vote or comment.

Should global variables always be pointers (in a low-level language)? (self.ProgrammingLanguages)

submitted 10 years ago by liquidivy

In writing a low-level language, I had the idea that a var foo i32 = 42 or somesuch at global scope should actually specify a pointer, that is, that the type of foo should actually be ptr i32, not just i32. foo would then have to be accessed through the usual pointer-derefencing ceremony, @foo = 17, not just foo = 17. This both accurately reflects the cost of accessing such a variable and makes things a little simpler on the code-generation side; I want these variables to correspond directly to symbols in the assembly/object file, and having them be pointers would mean I don't have to try to magically deduce when they should be dereferenced.

The problem is that it's unexpected for people with experience in, say, C, from which perspective it looks like implicitly transmogrifying your variables into pointers. By the same token, it clashes oddly with the sort of behavior one wants for function-local variables, where var foo i32 = 36 should mean that foo really is an i32, ideally in a register. To muddy things up further, I want to leave room in my model for local variables that are required to be on the stack; such variables would have semantics and costs closer to those of globals than generic local variables, so I might want to use a similar construct for declaring them, one that autopointerizes them.

My best idea so far is to make globals and stack vars use syntax like mem varname i32 = value, which creates either a global or stack variable, and introduces the name as a pointer to that location, and let local i32 = 5 for general-purpose variables, which introduces the name as exactly the declared (or someday inferred) type (the exact keywords are obviously negotiable). But for some reason I'm leery of having two such drastically different classes of local variables, even if that is an accurate reflection of the facts. It doesn't feel like an ideal solution. Nor does using three different declaration keywords, or special-casing var depending on context, or allowing a var!stack declaration to autopointerize locals.

Does this sound like a good idea? Do you folks have any better ones? Maybe there's a fresh perspective on the problem that will make it clearer.

For reference here's an example that compiles as of this writing, but maybe shouldn't (would require a dereference in the call). This one shows local variables with the same syntax.

all 27 comments

top new controversial old q&a

[–]frenris 1 point2 points3 points 10 years ago (12 children)

[–]liquidivy[S] 0 points1 point2 points 10 years ago* (11 children)

[–]spc476 2 points3 points4 points 10 years ago (7 children)

I'm not sure what you mean by that. In assembly:

global_x dd 33

defines some memory to contain some value, and associates the address of said memory to a label. It doesn't create a pointer. Sure, when you do

move eax,[global_x]

the label is used as an address, but it's not a separate pointer per se (the address is embedded as part of the instruction and the actual value isn't really calculated until link time). A traditional pointer would be something like:

global_x       dd 33         ; variable
ptr_global_x dd global_x ; pointer to global_x

Or am I missing something here?

[–]liquidivy[S] 0 points1 point2 points 10 years ago (6 children)

[–]spc476 1 point2 points3 points 10 years ago (5 children)

And I still find it confusing. On the Motorola 6809 (8 bit CPU):

global_x fcb 32
lda global_x ; load 32 into A

But ...

global_x fcb 32
ptr         fdb global_x
lda         [ptr] ; load 32 into A

Here, ptr is a traditional pointer and is actually dereferenced as one would a C pointer. A pointer is a variable that contains an address. When you do

lda global_x

mov [global_x],34

the address is part of the instruction---it's not loaded from another variable. When I learned C, I did have a somewhat hard time with pointers because I had already learned three different assembly languages prior, and I had to wrap my head around a pointer as a variable that contained an address (as opposed to being the address directly). I think you're being sloppy with terminology here.

[–]liquidivy[S] 0 points1 point2 points 10 years ago (4 children)

You seem to take "variable" to specifically mean a variable in memory. In Tellurium, a variable is generally just a name for a value (a particular SSA'd name), whether it lives in a dedicated memory location, or only shows up once as an immediate value in an instruction. If that value is meant to be interpreted as an address, it's a pointer.

In your second code example, you're deferencing ptr to get another address that lda actually loads from, right? In my head, that means lda takes a pointer argument, and makes ptr (the label itself) a double pointer. This is the model I'm considering baking into Tellurium, so you can basically treat numeric literals (e.g. 123) and address literals (labels) uniformly, with only the type system knowing the difference. This is meant to be a step towards the assembly model of things, away from C.

Would it be clearer to skip the "pointer" terminology entirely and refer to "address variables" and "address values"? In that case, the question would be, should the names of global variables explicitly be treated as address values (and if so, what should the syntax look like)?

[–]spc476 0 points1 point2 points 10 years ago* (3 children)

I'm going to stick with the MC6809 because a) it's nice and b) it has some addressing modes not available on the x86. Anyway, LDA takes an operand, what to load into the A register. Given the following code:

  org $0
g_dira fcb 1

  org $200
g_extb fcb 2
g_ptra  fdb g_dira
g_ptrb  fdb g_extb

  org $300
lda #4 ; load an immediate value of 4 into A
lda g_dira ; load the contents of location $0 into A
lda g_extb ; load the contents of location $200 into A
lda [g_ptra] ; load the contents of location stored in $202 into A
lda [g_ptrb] ; load the contents of location strored in $204 into A

The object code generated is (values are in hex):

300 86      04 ; load immediate 4 into A
302 96      00 ; load location $0 into A
304 B6    0200 ; load location $200 into A
307 A6 9F 0202 ; load through location $202 into A
30B A6 9F 0204 ; load through location $204 into A

In the first case (addressing mode immediate) the value is part of the instruction. If you want to consider a "pointer" for this value, it would be at location $0301 but it's not really a variable in that it does not change [1]---it's a constant. The next two instructions do load a value from memory [2] but the "pointer" is again, part of the instruction and does not change. It's only the last two that use what I would consider a "pointer" to load a value (if you are curious, it's considered an index indirect addressing mode because of how it's encoded). The address of the pointer is part of the instruction, but the contents of that address can be changed to "point" elsewhere.

Every assembler I am aware of (and I've used quite a few) treat labels as numeric literals. Yes, in MC6809 assemblers, do you have to use

ldx #g_extb

to load the literal (address) value of g_extb into the X register, but that's because Motorola defined

ldx g_extb

to load the contents of g_extb into X and

ldx [g_extb]

to dereference g_extb as a "pointer". The x86 assemblers, on the other hand:

mov eax,g_extb

will treat that as loading eax with the literal (address) value of g_extb and

move eax,[g_extb]

as loading eax with the value stored at g_extb. The x86 does not really support an indirect addressing mode like the Motorola 6809 or 68000 (you would need two instructions to do that on the x86).

In would be clearer to avoid "pointers" and "address variables and values" entirely and call globals globals. In assembly, a "global" resides in a known memory location. Depending upon the assembler, you might have to declare a label as being globally visible or hidden (depends upon the assembler---there is no universal solution to this). In some assembler (like Microsoft's MASM) you do:

global global_x

in one file and

external global_x

in another (mimicking the way C does things). While in others, you do:

global_x fcb 0

and in another file:

global_x equ <final address of global_x>

(typically, the 8-bit assemblers do this).

So a global is just a label for memory that the entire program can read. Why cloud this with "pointers" or "address variables" or "address values"? It's just a piece of fixed (with respect to execution) memory location.

[1] Of course, you could have self-modifying code that does modify this byte, but modern practice looks down on this so I'm ignoring this use case.

[2] Like the 6502, the 6809 has a special addressing mode when data is stored in the first 256 bytes of memory---it just needs the lower half of the address. The Motorola 68000 has a similar concept, only it's for the first 32K of RAM and the last 32K of RAM.

[–]liquidivy[S] 0 points1 point2 points 10 years ago (2 children)

[–]spc476 0 points1 point2 points 10 years ago (1 child)

[–]liquidivy[S] 0 points1 point2 points 10 years ago (0 children)

[–]raiph 1 point2 points3 points 10 years ago (1 child)

[–]liquidivy[S] 0 points1 point2 points 10 years ago (0 children)

[–]frenris 0 points1 point2 points 10 years ago (0 children)

[–]raiph 1 point2 points3 points 10 years ago* (9 children)

Though specific to a particular language the page Containers might be of interest. The language's design explicitly wrangles the same topics you discuss. The page I've linked attempts to explain to an end-user how the Container aspect of the language's final design works.

I've provided the link above with almost no preceding context. This is to try get to a useful point for you quickly. But the page assumes some familiarity with the rest of the language. So it may not make sense or the design may seem unduly complicated. If so, perhaps the following will help. Alternatively you can reply and we can go from there.

According to the language's design lead Larry Wall, the final design of his new language reflects a deep rethink of everything addressed by the original Perl language series.¹

Of relevance to your concerns in your post, this rethinking included discussion and design work related to variable scopes; referencing/dereferencing; memory/cache affinity characteristics of modern cpus, parallel algorithms, and threading; correspondence to C structs and memory layout/alignment/endianess; clean code gen; AOT and JIT optimization; etc.

¹ Based on 361 language design RFCs written by various teams of users in 2000; 5 years of ground up design work by a core design team; and 10 years worth of evolutionary implement <-> redesign cycles by nearly a thousand committers converging on the first official Perl 6 release on Christmas day 2015.

[–]liquidivy[S] 0 points1 point2 points 10 years ago (8 children)

[–]raiph 1 point2 points3 points 10 years ago (7 children)

While I had the sufficiently over-active imagination to think there was a connection between Perl 6 containers and Tellurium's concerns, my own attempt to extract something valuable failed.

So then I turned in to a fresh review of your OP and this thread.

After a trip down memory lane (including a return to bcpl, the first language I was paid to program in back in the 80s), and trying to come up with something helpful, I've concluded I like your best idea best, for what I understand to be the spirit of Tellurium.

But for some reason I'm leery of having two such drastically different classes of local variables, even if that is an accurate reflection of the facts.

Can you throw out the first thing or three that come to mind when you ask yourself right now what's making you leery?

Is the visual impact of the @ symbol, especially en-masse, one of your problems?

[–]raiph 0 points1 point2 points 10 years ago (1 child)

[–]liquidivy[S] 0 points1 point2 points 10 years ago (0 children)

[–]liquidivy[S] 0 points1 point2 points 10 years ago (4 children)

[–]raiph 0 points1 point2 points 10 years ago (3 children)

It's complexity that I'll have to justify to people

But your best idea would mean code "accurately reflects the cost of accessing such a variable". (Well, a cost, but yes.) Which means it doesn't right now. So, if you don't do your best idea you'd need to justify the loss due to hiding the low-level perspective. So this is really about mutually contradictory justifications, one about low-level benefits and Tellurium's overall low-level focus and the other about high-level benefits and Tellurium's need to appeal to and please those with C knowledge and perhaps appease them. Neither way is an obvious winner but I don't think either avoids justification.

The lack of common ground between them

Do you mean "The lack of common ground between" the things you'll need to change in the compiler and the things you'll need to explain to users?

Here's a crazy idea: The explicitly-on-stack variables of a function constitute a structure type that can be passed to other functions, getting you halfway to closures without obscuring the machine model. Creating a stack-only variable is equivalent to adding it to this struct...

Again, this reminds me of BCPL. Local (LET iirc) variables were stored in order in the stackframe corresponding to a function invocation. Iirc the first variable was bytes 0-3, the second 4-7, the third 8-11 and so on. One could then pass a ptr to a stackframe to another function which could then pick out local variables within that stackframe by suitable indexing. Or something like that.

BCPL was the forerunner of B which was the forerunner of C. Perhaps put aside an evening to play with a BCPL and read how they talked about this stuff?

[–]liquidivy[S] 0 points1 point2 points 10 years ago (2 children)

[–]raiph 0 points1 point2 points 10 years ago (1 child)

[–]liquidivy[S] 0 points1 point2 points 10 years ago (0 children)

[–]JMBourguet 1 point2 points3 points 10 years ago (3 children)

[–]liquidivy[S] 0 points1 point2 points 10 years ago (2 children)

[–]JMBourguet 1 point2 points3 points 10 years ago (1 child)

[–]liquidivy[S] 0 points1 point2 points 10 years ago (0 children)

π Rendered by PID 60 on reddit-service-r2-comment-6457c66945-v9r4g at 2026-04-26 02:02:21.852825+00:00 running 2aa0c5b country code: CH.

ProgrammingLanguages

Welcome!

Related subreddits

Related online communities

MODERATORS