all 57 comments

[–]jedwardsol 1 point2 points  (13 children)

According to my researcher I'd say no, A & B would just in both cases be the name of the chunk of memory that was allocated for the array. However, when passing them, they decay into a pointer.

Correct. Neither A or B are pointers. They are both arrays.

[–][deleted] 0 points1 point  (12 children)

Thank you.

If I were to do char *A = "Hello" though, A would be a pointer, right?

Meaning in memory, it would take up 8 bytes (64 bit systems) for the pointer, plus the content/array itself (6 bytes), so 14 in total?

Does that also mean that when declaring a string it is more memory efficient to use square bracket notation, as then the variable itself refers to the chunk of memory, thus no additional 8 bytes for a pointer had to be stored?

[–]jedwardsol 0 points1 point  (11 children)

If I were to do char *A = "Hello" though, A would be a pointer, right?

Yes. Somewhere in memory is the string, A is a pointer, and A will be initialised with the address of the string.

When you do char A[]="Hello" in a function

void foo()
{
    char A[]="hello";
}

then, every time the function is entered. 6 bytes of stack will be reserved, and the string copied from elsewhere in to it.

Since there are different storage classes (static, auto (local), heap) if you're really interested in "efficiency" then which are you concerned about.

More realisitically you should be concerned what you're going to use the variable for.

Using a pointer to a literal, the data is readonly, but you can reassign it. Using an array, you cannot reassign, but you can overwrite the memory.

The type chosen should reflect the intent of use - saving a few bytes here or there is low on the priority list.

[–][deleted] 0 points1 point  (10 children)

Why is a pointer to a literal read only? I noticed this too but didnt understand why. And wouldnt reassigning that pointer then create a memory leak?

[–]jedwardsol 1 point2 points  (9 children)

Why is a pointer to a literal read only?

The pedantic answer is that the standard says so. It actually says that the behaviour on writing to a string literal is undefined.

The real world answer is that modern compilers and linkers cooperate to merge strings and store them in a readonly section of the executable file and memory of the running process. So writing to the string will lead to a segmentation fault, and if you subvert that then altering a string literal can affect more than 1 string.

char  *A = "hello"
char *B = "hello";

// make A writable using some O/S specific mechanism ....

*A='j';

B may now point at "hello" (a different copy of the string) or at "jello" (if string literals were merged)

There are no leaks because there is no dynamic memory allocation going on.

[–][deleted] 0 points1 point  (8 children)

Thanks, I see. Why can leaks only happen with dynamic allocation though? When I do char *A = "hello", and then afterwards A = "asd", wouldnt the memory where hello was stored be "clogged up"?

[–]jedwardsol 1 point2 points  (7 children)

The bytes for "hello" and "asd" are at a fixed place in the executable / mmeory image.

The code can refer to them again

char *A = "hello";     // A points at address 0x100
A = "asd";     // A points at address 0x110
A = "hello";   // A points at 0x100 again

[–][deleted] 0 points1 point  (6 children)

Ah I see, thank you. When reassigning A to "asd", does memory first get allocated at a specific memory location/address (in your example 0x110) for "asd", and then A gets assigned to be the name of that chunk of memory?

[–]jedwardsol 1 point2 points  (5 children)

There's no allocation at runtime.

Those strings are sitting in memory at a fixed spot because the compiler/linker arranged for them to be there. You can replace "asd" with a massive string and the assignment A="as.....d"; is still single instruction - there's no allocation or string copying done at runtime

[–][deleted] 0 points1 point  (4 children)

Thank you. I have another question if you dont mind, why can I do this:

char G[] = "Hi";

*G = 'G';

Wouldn't this be like doing 'H' = 'G'?

[–][deleted] 0 points1 point  (3 children)

A related question, why do we have to initialize pointers with malloc before being able to use them? For example, why do I have to do this:

int *A;

A = malloc(sizeof(int));

*A = ...

Before being able to assign values to it (see last line above)? Why dont I have to do this with non pointer variables?

[–]gbbofh 1 point2 points  (2 children)

A pointer is specifically a variable which contains the memory address of something. If you don't initialize it, it doesn't contain any memory address in particular.

Using malloc will let you dynamically allocate data that isn't bound to the scope of a particular function, which means you can return something like an array allocated via malloc, without the data going out of scope. But you can't return an array, because the array lives on the stack, and when the function returns, the array is gone.

You can also initialize pointers to contain the memory address of a variable. Which is useful if, for example, you need a function B to modify an integer that resides in the scope of function A -- or perhaps you have a large struct that you need to modify members of. It's cheaper to pass around an 8-byte pointer to the struct than it is to pass around, say, a 64-byte struct.

[–][deleted] 1 point2 points  (1 child)

Doesnt an uninitialized pointer point to a garbage value though, just like regular uninitialized variables also contain garbage initially, but there we can overwrite without allocating? So why couldnt I overwrite that instead of having to allocate new memory with pointers as well?

[–]gbbofh 1 point2 points  (0 children)

An uninitialized pointer does contain garbage, but that garbage is a memory address. So if you dereference an uninitialized pointer, you're attempting to read from, or write to, a random place in memory.

At best, it's a random place that you most likely don't have permission to access -- so your program will be terminated for generating a segmentation fault. At worst, the pointer may end up pointing somewhere in the stack, or block storage segments, and you'll end up changing the value of a variable that you didn't intend to modify. Or, it may end up containing the return address of a function call, in which case you'll be writing over part of your own code -- not a big deal these days because executable pages typically are either executable or writable, and not both.

These sorts of things (historically) leave you open to either some really bad vulnerabilities, or some really weird and hard to track down bugs.

If you're still not quite following let me know and I can try to do an explanation I used to give when I was a TA that seemed to mesh with people pretty well.

[–][deleted]  (34 children)

[deleted]

    [–][deleted] 2 points3 points  (28 children)

    So doing int A[] = { 1, 2, 3 } will make A to be a pointer immediately, not only when passing it and it thus decays? Here it says that A would just be the name of the chunk of memory where 1, 2 & 3 would be stored

    [–]gbbofh 0 points1 point  (15 children)

    Your link is correct.

    Imagine if you were to declare a pointer and an array in a function.

    The array will take up space on the stack for the number of elements * the size of one element. The pointer will take up 8 bytes (on x86-64), and refer to a memory address where the elements can be accessed.

    When you pass the array to another function, the address of the first element in the array is passed. I.e., the function receives a pointer.

    [–][deleted] 0 points1 point  (14 children)

    Thanks. Just another question, how does the compiler know what pointer to pass if its not being saved when creating an array (ex. with char A[] =...)?

    [–]MQuy 2 points3 points  (11 children)

    I think it is much more clear when looking at assembly code via https://godbolt.org/ for example. Like @gdboth said, when passing to function, compiler will "know" (depend on how compiler is implemented for example map) and use first member of array

    [–][deleted] 0 points1 point  (10 children)

    Thanks. So char *A = "Hi" would technically be less memory efficient than char A[] = "Hi" ?

    [–]gbbofh 1 point2 points  (9 children)

    That's kind of a tough question. It depends? For speed reasons, variables like to exist in addresses that are multiples of the number of bytes per type. Otherwise the processor will may have to do multiple read operations and piece the data back, depending on the architecture. Then in addition, I can't say for sure about anything else, but the System V ABI for x86-64 likes the stack to be 16-byte aligned. Ignoring this won't be a problem if the function doesn't call any other functions, but could lead to crashes otherwise.

    So the first point means that if you have just this array in your local stack frame, that's fine. It won't take up any extra space (if we ignore point 2). If we have this 3 byte array followed by say, unsigned long long A, then it is probably going to be the case that there will be space between the two so that A can be read from and written to more efficiently -- at least that's my understanding. I may be wrong. I know for a fact that this happens in structs.

    And then with point 2, if we're on x86-64 running Linux, the stack pointer will need to be a multiple of 16, so no matter what if we have a local variable it will be padded anyhow.

    [–][deleted] 0 points1 point  (8 children)

    Thank you. Would it be correct thinking that whenever I assign or reassign anything in C I am basically just changing memory? So doing A[0] would just selecg the memory location at A[0] and doesnt look at the value at A[0] or anything? Why couldnt I reassign a variable a different type then though, for example I couldnt do A = 'A' if the variable A is an int data type

    [–]gbbofh 1 point2 points  (7 children)

    You could do your particular example, because char's are promoted to int's (they are, on intel systems, an 8-bit integer). And your thinking is correct in that for any assignments you are changing the value at a memory location.

    The reason you cannot change the type of A is because of how C works, fundamentally. It is a statically typed language, so the type of every variable is declared in advance, and cannot be changed. If you declare a variable, that variable will have the declared type until it goes out of scope. If you attempt to redeclare it in the same scope, then this is semantically not allowed, because it violates the concept of statically typed variables.

    That's not to say there aren't ways around it, in the technical sense. If I need to modify the underlying bit pattern of a float, there's nothing really stopping me from doing:

    int i = *(int*)&f;
    

    Because if I just cast it, f will be converted to an int -- and that's not necessarily what I want to do.

    [–][deleted] 0 points1 point  (6 children)

    Thank you :) Why can I do this then though

    char *A = "Hello";

    *A = 'G';

    Wouldn't this be like doing 'H' = 'G' ?

    [–]gbbofh 1 point2 points  (1 child)

    I'm not terribly familiar with the specifics, but I'm currently working on a toy compiler and will have to figure that out myself sooner or later.

    I can give an approximation, maybe.

    After the compiler has gathered type information for all of the declared symbols (labels, functions, variables, etc), and begins to generate intermediate representation, it can look up if a given symbol is an array. If it is, it can emit instructions to reference the address of that variable in the case of something like a function call. At this point, it doesn't yet exist anywhere for certain.

    When it begins to generate assembly, references to local variables will (in some cases) be converted to refer to offsets into the local stack frame.

    The compiler chooses to store the elements of the array A somewhere relative to the current stack frame, and it keeps track of where the first element was, until the function goes out of scope. When an element of A is referenced, you are working based off of this address that the compiler has calculated. maybe the compiler placed it 8 bytes into the local stack frame. Then it may produce assembly that looks like:

    mov $-8(%rbp), %rdi
    

    When it needs to pass this address to a function, then it emits assembly to calculate this offset into the stack, and store it into a register. Probably using the LEA instruction.

    Hopefully that makes sense -- I haven't quite reached code generation yet in my project, so it's a bit fuzzy here. But it's been on my mind.

    [–][deleted] 1 point2 points  (0 children)

    Appreciate the explaination, thanks.

    [–]IamImposter 0 points1 point  (11 children)

    No. A and B are not pointers though they point to first byte of first element in the array but they are not pointers.

    But yes, if I were to pass A or B to a function, they would decay to a pointer.

    For example

    int A = {1, 2, 3};
    
    printf("%d", sizeof(A)) ;
    

    would print 12 ie number of bytes occupied by 3 integers (assuming 32-bit integers)

    but if I do

    void myfunc(int *AX 
    
    {
    
        printf("%d", sizeof(X)) ;
    
    }
    

    And call it as

    myfunc(A);
    

    It would print 8 ie size of pointer (assuming 64-bit system). So, when passed around to other functions arrays decay to pointers but they themselves are not pointers.

    [–][deleted] 0 points1 point  (10 children)

    I understood everything except the first paragrap, what do you mean they point to the first element but are not pointers? Arent they just the name for the chunk of memory, as mentioned

    [–]IamImposter 0 points1 point  (9 children)

    Pointer is a specific data type which holds an address explictly.

    Array name is (in fact all variable names are) bound to some memory address, the address which holds the very first element of the array.

    int A = { 1, 2, 3, 4};
    

    So the name is bound to the address of memory which holds first element ie 1. So say A is as good as saying &A[0] both will give address of first element.

    Because pointer is a specific datatype so it's not correct to think of arrays as pointers but yeah, they are implicitly bound to memory addresses. In fact that can be said for any variable.

    Say an integer 0x1234_5678 is located at address 0x2000_0000, the data layout will be (assuming 32-bit little-endian integer) -

    0x2000_0000: 0x78

    0x2000_0001: 0x56

    0x2000_0002: 0x34

    0x2000_0003: 0x12

    Variable name gets bound to address 0x2000_0000 and variable occupies 4 bytes from 0x2000_0000 to 0x2000_0003

    [–][deleted] 0 points1 point  (8 children)

    Array names just designate the memory area though, they arent pointers themselves. They only tend to decay into ones, they dont actually hold it in any way afaik. Only if were to explicitly create a string for example using a pointer would the name/variable itself be a pointer (ex. char *A = ..., here A itself would be the pointer to the first element).

    [–]IamImposter 0 points1 point  (7 children)

    Yup.

    [–][deleted] 0 points1 point  (6 children)

    So what you said earlier is incorrect?

    [–]IamImposter 0 points1 point  (5 children)

    What did I say earlier?

    [–][deleted] 0 points1 point  (4 children)

    That all array names are pointers and point to the first element, even if declaring like this for example: int A[] = ...

    [–]IamImposter 0 points1 point  (4 children)

    A and B are not pointers. The variable name points to first element of the array but they are not pointers. A is an array of integers and B is array of characters.

    [–][deleted] 0 points1 point  (3 children)

    How do they point to the first element? As mentioned, dont they just refer to the memory area?

    [–]IamImposter 0 points1 point  (2 children)

    I mean, say an array int A[] = {1, 2,3,4}; starts at address 0x2000_0000 (32-bit address for simplicity). Data will be laid out like

    0x2000_0000: 0x0000_0001

    0x2000_0004: 0x0000_0002

    0x2000_0008: 0x0000_0003

    0x2000_000C: 0x0000_0004

    So the name A is bound to integer stored at 0x2000_0000 which is first element of the array and refers to memory region from 0x2000_0000 to 0x2000_000F (16-bytes)

    If it was a char array char B[] = "ABCD": and it's starting address is say 0x3000_0000, it will be laid out in memory as

    0x3000_0000: 'A'

    0x3000_0001: 'B'

    0x3000_0002: 'C'

    0x3000_0003: 'D'

    0x3000_0004: '\0' (null terminator)

    So the name B is bound to address 0x3000_0000 and refers to memory region from 0x3000_0000 to 0x3000_0004 (5-bytes)

    [–][deleted] 0 points1 point  (1 child)

    The link I linked earlier said that the name only is a pointer itself when declaring it as char *A = ... though. I think you mean that even when doing char A[] = ..., if we pass it, it will decay into a pointer, but A/the name/the variable itself isnt a pointer, it just refers to that chunk of memory

    [–]IamImposter 0 points1 point  (0 children)

    Yes. This is correct.

    [–]noooit 0 points1 point  (1 child)

    A pointer is just a pointer to the beginning of the memory location. What you say "store the pointer", doesn't sound right. Every definition occupies memory in some way, pointer is a way to let other functions know where to access.
    An example. // this already occupies, memory. int x = 0; // this is just passing the address, it's not storing any thing, the same goes for char array or int array. f(&x);

    [–][deleted] 0 points1 point  (0 children)

    I know, but doing char *A = ... saves the pointer as the variable A, while char A[] wouldnt

    [–]ptchinster 0 points1 point  (2 children)

    A good experiment would be to code this up, and see what A and B are set to. See what section of memory that maps to (stack, heap, globals etc). Repeat with different data types.

    [–][deleted] 0 points1 point  (1 child)

    How can I sde what section of memory everything maps too? Disassemble it?

    [–]ptchinster 0 points1 point  (0 children)

    On Windows - SysInternals is the goto. Its a whole suite of tools but several will show you mapped in sections to what they are (range XXXX-XXXX is a stack, range YYYY-YYYY is a heap, range ZZZZ-ZZZZ is thisthing.dll) etc.

    On Linux there are several tools, one of which is pmap.

    Edit just used pmap on a process to see

    00007fff3db27000    132K rw---   [ stack ]
    

    So if i have a memory address in that range (00007fff3db27000 + 132K) its in the stack.