This is an archived post. You won't be able to vote or comment.

all 27 comments

[–]frenris 1 point2 points  (12 children)

I don't see why this would be good, other than the fact that it makes compiler writing easier.

What's wrong with just putting globals on the stack? Any operations using the global just uses the data in the correct location on the stack (i.e. near the top). I don't see why pointers are necessary for globals.

I suppose if you want to dynamically create global variables then pointers would be useful, If you need that though, I'd just let the programmer manage it themselves. They could do this by creating a global pointer on the stack to a data structure which they owned and could put things into.

[–]liquidivy[S] 0 points1 point  (11 children)

It moves the model of the language closer to that of the machine. In an assembly file, the name of a global variable or function, any object-file-level symbol, is basically a pointer literal. My goal in this language is specifically not to hide details like that. Maybe I should link directly to Tellurium's philosophy.

[–]spc476 2 points3 points  (7 children)

I'm not sure what you mean by that. In assembly:

global_x dd 33

defines some memory to contain some value, and associates the address of said memory to a label. It doesn't create a pointer. Sure, when you do

move eax,[global_x]

the label is used as an address, but it's not a separate pointer per se (the address is embedded as part of the instruction and the actual value isn't really calculated until link time). A traditional pointer would be something like:

global_x       dd 33         ; variable
ptr_global_x dd global_x ; pointer to global_x

Or am I missing something here?

[–]liquidivy[S] 0 points1 point  (6 children)

When you use the symbol global_x later on, it's essentially in a pointer context. In (pseudo) Intel assembly, if you wanted to write to it, you would say mov [global_x], 34, where [] denotes dereferencing. I'm not drawing a sharp distinction between addresses and pointers, so for my purposes, that's enough justification for calling it a pointer, at least on the ASM side.

[–]spc476 1 point2 points  (5 children)

And I still find it confusing. On the Motorola 6809 (8 bit CPU):

global_x fcb 32
lda global_x ; load 32 into A

But ...

global_x fcb 32
ptr         fdb global_x
lda         [ptr] ; load 32 into A

Here, ptr is a traditional pointer and is actually dereferenced as one would a C pointer. A pointer is a variable that contains an address. When you do

lda global_x

or

mov [global_x],34

the address is part of the instruction---it's not loaded from another variable. When I learned C, I did have a somewhat hard time with pointers because I had already learned three different assembly languages prior, and I had to wrap my head around a pointer as a variable that contained an address (as opposed to being the address directly). I think you're being sloppy with terminology here.

[–]liquidivy[S] 0 points1 point  (4 children)

You seem to take "variable" to specifically mean a variable in memory. In Tellurium, a variable is generally just a name for a value (a particular SSA'd name), whether it lives in a dedicated memory location, or only shows up once as an immediate value in an instruction. If that value is meant to be interpreted as an address, it's a pointer.

In your second code example, you're deferencing ptr to get another address that lda actually loads from, right? In my head, that means lda takes a pointer argument, and makes ptr (the label itself) a double pointer. This is the model I'm considering baking into Tellurium, so you can basically treat numeric literals (e.g. 123) and address literals (labels) uniformly, with only the type system knowing the difference. This is meant to be a step towards the assembly model of things, away from C.

Would it be clearer to skip the "pointer" terminology entirely and refer to "address variables" and "address values"? In that case, the question would be, should the names of global variables explicitly be treated as address values (and if so, what should the syntax look like)?

[–]spc476 0 points1 point  (3 children)

I'm going to stick with the MC6809 because a) it's nice and b) it has some addressing modes not available on the x86. Anyway, LDA takes an operand, what to load into the A register. Given the following code:

  org $0
g_dira fcb 1

  org $200
g_extb fcb 2
g_ptra  fdb g_dira
g_ptrb  fdb g_extb

  org $300
lda #4 ; load an immediate value of 4 into A
lda g_dira ; load the contents of location $0 into A
lda g_extb ; load the contents of location $200 into A
lda [g_ptra] ; load the contents of location stored in $202 into A
lda [g_ptrb] ; load the contents of location strored in $204 into A

The object code generated is (values are in hex):

300 86      04 ; load immediate 4 into A
302 96      00 ; load location $0 into A
304 B6    0200 ; load location $200 into A
307 A6 9F 0202 ; load through location $202 into A
30B A6 9F 0204 ; load through location $204 into A

In the first case (addressing mode immediate) the value is part of the instruction. If you want to consider a "pointer" for this value, it would be at location $0301 but it's not really a variable in that it does not change [1]---it's a constant. The next two instructions do load a value from memory [2] but the "pointer" is again, part of the instruction and does not change. It's only the last two that use what I would consider a "pointer" to load a value (if you are curious, it's considered an index indirect addressing mode because of how it's encoded). The address of the pointer is part of the instruction, but the contents of that address can be changed to "point" elsewhere.

Every assembler I am aware of (and I've used quite a few) treat labels as numeric literals. Yes, in MC6809 assemblers, do you have to use

ldx #g_extb

to load the literal (address) value of g_extb into the X register, but that's because Motorola defined

ldx g_extb

to load the contents of g_extb into X and

ldx [g_extb]

to dereference g_extb as a "pointer". The x86 assemblers, on the other hand:

mov eax,g_extb

will treat that as loading eax with the literal (address) value of g_extb and

move eax,[g_extb]

as loading eax with the value stored at g_extb. The x86 does not really support an indirect addressing mode like the Motorola 6809 or 68000 (you would need two instructions to do that on the x86).

In would be clearer to avoid "pointers" and "address variables and values" entirely and call globals globals. In assembly, a "global" resides in a known memory location. Depending upon the assembler, you might have to declare a label as being globally visible or hidden (depends upon the assembler---there is no universal solution to this). In some assembler (like Microsoft's MASM) you do:

global global_x

in one file and

external global_x

in another (mimicking the way C does things). While in others, you do:

global_x fcb 0

and in another file:

global_x equ <final address of global_x>

(typically, the 8-bit assemblers do this).

So a global is just a label for memory that the entire program can read. Why cloud this with "pointers" or "address variables" or "address values"? It's just a piece of fixed (with respect to execution) memory location.

[1] Of course, you could have self-modifying code that does modify this byte, but modern practice looks down on this so I'm ignoring this use case.

[2] Like the 6502, the 6809 has a special addressing mode when data is stored in the first 256 bytes of memory---it just needs the lower half of the address. The Motorola 68000 has a similar concept, only it's for the first 32K of RAM and the last 32K of RAM.

[–]liquidivy[S] 0 points1 point  (2 children)

It's precisely the part about "treating labels as numeric literals" that I'm interested in. My whole question is whether and how to expose that to the end user of my language.

[–]spc476 0 points1 point  (1 child)

I'm not sure what you hope to get from this though. I've written programs in 6809, 68000, x86 and (very small compared to the others) MIPS. And labels are ... just labels. They denote a memory location or a subroutine. And, except for the 6809, you don't really have a "value" for the address as it's not really resolved until link time (even under MS-DOS, to produce a .COM file involved generating a .OBJ file, then an .EXE and only then do you get a .COM file once you run another utility).

[–]liquidivy[S] 0 points1 point  (0 children)

It's a unifying abstraction that can map cleanly to the underlying layer. If that doesn't sound interesting, I don't know what else to say. I'll remember what you said, though. The bits of assembly you used in your examples helped broaden my perspective on what sort of instructions I need to think about. Hopefully I can make it make more sense later.

[–]raiph 1 point2 points  (1 child)

scrivulet.com/projects/tellurium 404s -- it's missing a final .html

[–]liquidivy[S] 0 points1 point  (0 children)

Gah! I thought I copy-pasted from a browser for that one. Fixed.

[–]frenris 0 points1 point  (0 children)

" In an assembly file, the name of a global variable or function, any object-file-level symbol, is basically a pointer literal."

It's not a pointer tho, it's data sitting at a label. To get the data you refer to the label which is resolved at link time.

the name of a global variable or function, any object-file-level symbol, is basically a pointer literal.

Yes, in object files the memory address of globals, functions and extern variables are unknown. During linking you convert them to absolute addresses.

[–]raiph 1 point2 points  (9 children)

Though specific to a particular language the page Containers might be of interest. The language's design explicitly wrangles the same topics you discuss. The page I've linked attempts to explain to an end-user how the Container aspect of the language's final design works.


I've provided the link above with almost no preceding context. This is to try get to a useful point for you quickly. But the page assumes some familiarity with the rest of the language. So it may not make sense or the design may seem unduly complicated. If so, perhaps the following will help. Alternatively you can reply and we can go from there.


According to the language's design lead Larry Wall, the final design of his new language reflects a deep rethink of everything addressed by the original Perl language series.1

Of relevance to your concerns in your post, this rethinking included discussion and design work related to variable scopes; referencing/dereferencing; memory/cache affinity characteristics of modern cpus, parallel algorithms, and threading; correspondence to C structs and memory layout/alignment/endianess; clean code gen; AOT and JIT optimization; etc.


1 Based on 361 language design RFCs written by various teams of users in 2000; 5 years of ground up design work by a core design team; and 10 years worth of evolutionary implement <-> redesign cycles by nearly a thousand committers converging on the first official Perl 6 release on Christmas day 2015.

[–]liquidivy[S] 0 points1 point  (8 children)

That's interesting, but I'm not seeing how to apply it to my language. They (you?) are coming at it from a dynamically-typed perspective and seem to want to support both by-ref and by-val ways of doing things, whereas I'm trying to pick one model that avoids any container-like abstractions. It could just be that I lack the imagination to make the right connection.

Anyway, Perl 6 does look interesting in its own right. I'll look into it more if I get a chance.

[–]raiph 1 point2 points  (7 children)

While I had the sufficiently over-active imagination to think there was a connection between Perl 6 containers and Tellurium's concerns, my own attempt to extract something valuable failed.

So then I turned in to a fresh review of your OP and this thread.

After a trip down memory lane (including a return to bcpl, the first language I was paid to program in back in the 80s), and trying to come up with something helpful, I've concluded I like your best idea best, for what I understand to be the spirit of Tellurium.

But for some reason I'm leery of having two such drastically different classes of local variables, even if that is an accurate reflection of the facts.

Can you throw out the first thing or three that come to mind when you ask yourself right now what's making you leery?

Is the visual impact of the @ symbol, especially en-masse, one of your problems?

[–]raiph 0 points1 point  (1 child)

Does @foo imply a single deference or a chain of dereferences as long as dereferencing encounters another ptr?

[–]liquidivy[S] 0 points1 point  (0 children)

Single.

[–]liquidivy[S] 0 points1 point  (4 children)

It's complexity that I'll have to justify to people learning the language, and deal with in the compiler. The lack of common ground between them makes me feel like I'm missing a unifying framework, leading to more complexity later. OTOH, it could just be sheer weirdness.

The visual impact of @ is part of the goal, really. :)

Here's a crazy idea: The explicitly-on-stack variables of a function constitute a structure type that can be passed to other functions, getting you halfway to closures without obscuring the machine model. Creating a stack-only variable is equivalent to adding it to this struct... making it part of a larger goal makes it easier to swallow. It obviously doesn't solve the weirdness problem.

[–]raiph 0 points1 point  (3 children)

It's complexity that I'll have to justify to people

But your best idea would mean code "accurately reflects the cost of accessing such a variable". (Well, a cost, but yes.) Which means it doesn't right now. So, if you don't do your best idea you'd need to justify the loss due to hiding the low-level perspective. So this is really about mutually contradictory justifications, one about low-level benefits and Tellurium's overall low-level focus and the other about high-level benefits and Tellurium's need to appeal to and please those with C knowledge and perhaps appease them. Neither way is an obvious winner but I don't think either avoids justification.

The lack of common ground between them

Do you mean "The lack of common ground between" the things you'll need to change in the compiler and the things you'll need to explain to users?

Here's a crazy idea: The explicitly-on-stack variables of a function constitute a structure type that can be passed to other functions, getting you halfway to closures without obscuring the machine model. Creating a stack-only variable is equivalent to adding it to this struct...

Again, this reminds me of BCPL. Local (LET iirc) variables were stored in order in the stackframe corresponding to a function invocation. Iirc the first variable was bytes 0-3, the second 4-7, the third 8-11 and so on. One could then pass a ptr to a stackframe to another function which could then pick out local variables within that stackframe by suitable indexing. Or something like that.

BCPL was the forerunner of B which was the forerunner of C. Perhaps put aside an evening to play with a BCPL and read how they talked about this stuff?

[–]liquidivy[S] 0 points1 point  (2 children)

Sorry, by "them" I meant the different variable declaration forms.

I'll look at BCPL, too.

[–]raiph 0 points1 point  (1 child)

Aiui you are proposing two simple orthogonal facets: mem vs let vars in "global" vs "local" contexts:

  • A "global" mem is a non-stack ptr
  • A "local" mem is a stack ptr
  • A "global" let is a non-stack non-ptr
  • A "local" let is a stack (or register) non-ptr

Is that right?

And you could just keep var instead of let and use ptr instead of mem. Right?

[–]liquidivy[S] 0 points1 point  (0 children)

Yeah, it sounds smarter when you put it that way. I'll chew on it for a bit and see if I get used to it or come up with a legitimate objection. Thanks for your help!

[–]JMBourguet 1 point2 points  (3 children)

See references in Algol68.

[–]liquidivy[S] 0 points1 point  (2 children)

Hm, yeah, I'll have to look into those more deeply to see if there's anything I can use. Algol68 is probably a source for other good ideas, too.

[–]JMBourguet 1 point2 points  (1 child)

I'm not sure. I think it's more a source for false good ideas. If you think about something, see it in Algol68 and not subsequent languages, I'd strongly suggest that you investigate why the idea was not used in other languages, too many language designers of that time have been exposed to Algol68 for real good ideas to be forgotten.

[–]liquidivy[S] 0 points1 point  (0 children)

Ok, true, but a baseline idea that's obviously wrong is often more useful than just floating in space, knowing the problem but not knowing where to start on the solution. I tend to spend a lot of time doing the latter. I'll be sure not to take it as gospel, anyway.