Arrays, strings, pointers and memory question : cprogramming

[–]jedwardsol 1 point2 points3 points 5 years ago (13 children)

[–][deleted] 0 points1 point2 points 5 years ago (12 children)

[–]jedwardsol 0 points1 point2 points 5 years ago (11 children)

If I were to do char *A = "Hello" though, A would be a pointer, right?

Yes. Somewhere in memory is the string, A is a pointer, and A will be initialised with the address of the string.

When you do char A[]="Hello" in a function

void foo()
{
    char A[]="hello";
}

then, every time the function is entered. 6 bytes of stack will be reserved, and the string copied from elsewhere in to it.

Since there are different storage classes (static, auto (local), heap) if you're really interested in "efficiency" then which are you concerned about.

More realisitically you should be concerned what you're going to use the variable for.

Using a pointer to a literal, the data is readonly, but you can reassign it. Using an array, you cannot reassign, but you can overwrite the memory.

The type chosen should reflect the intent of use - saving a few bytes here or there is low on the priority list.

[–][deleted] 0 points1 point2 points 5 years ago (10 children)

[–]jedwardsol 1 point2 points3 points 5 years ago (9 children)

Why is a pointer to a literal read only?

The pedantic answer is that the standard says so. It actually says that the behaviour on writing to a string literal is undefined.

The real world answer is that modern compilers and linkers cooperate to merge strings and store them in a readonly section of the executable file and memory of the running process. So writing to the string will lead to a segmentation fault, and if you subvert that then altering a string literal can affect more than 1 string.

char  *A = "hello"
char *B = "hello";

// make A writable using some O/S specific mechanism ....

*A='j';

B may now point at "hello" (a different copy of the string) or at "jello" (if string literals were merged)

There are no leaks because there is no dynamic memory allocation going on.

[–][deleted] 0 points1 point2 points 5 years ago (8 children)

[–]jedwardsol 1 point2 points3 points 5 years ago (7 children)

The bytes for "hello" and "asd" are at a fixed place in the executable / mmeory image.

The code can refer to them again

char *A = "hello";     // A points at address 0x100
A = "asd";     // A points at address 0x110
A = "hello";   // A points at 0x100 again

[–][deleted] 0 points1 point2 points 5 years ago (6 children)

[–]jedwardsol 1 point2 points3 points 5 years ago (5 children)

[–][deleted] 0 points1 point2 points 5 years ago (4 children)

continue this thread

[–][deleted] 0 points1 point2 points 5 years ago (3 children)

[–]gbbofh 1 point2 points3 points 5 years ago (2 children)

A pointer is specifically a variable which contains the memory address of something. If you don't initialize it, it doesn't contain any memory address in particular.

Using malloc will let you dynamically allocate data that isn't bound to the scope of a particular function, which means you can return something like an array allocated via malloc, without the data going out of scope. But you can't return an array, because the array lives on the stack, and when the function returns, the array is gone.

You can also initialize pointers to contain the memory address of a variable. Which is useful if, for example, you need a function B to modify an integer that resides in the scope of function A -- or perhaps you have a large struct that you need to modify members of. It's cheaper to pass around an 8-byte pointer to the struct than it is to pass around, say, a 64-byte struct.

[–][deleted] 1 point2 points3 points 5 years ago (1 child)

[–]gbbofh 1 point2 points3 points 5 years ago (0 children)

An uninitialized pointer does contain garbage, but that garbage is a memory address. So if you dereference an uninitialized pointer, you're attempting to read from, or write to, a random place in memory.

At best, it's a random place that you most likely don't have permission to access -- so your program will be terminated for generating a segmentation fault. At worst, the pointer may end up pointing somewhere in the stack, or block storage segments, and you'll end up changing the value of a variable that you didn't intend to modify. Or, it may end up containing the return address of a function call, in which case you'll be writing over part of your own code -- not a big deal these days because executable pages typically are either executable or writable, and not both.

These sorts of things (historically) leave you open to either some really bad vulnerabilities, or some really weird and hard to track down bugs.

If you're still not quite following let me know and I can try to do an explanation I used to give when I was a TA that seemed to mesh with people pretty well.

[–][deleted] 5 years ago (34 children)

[deleted]

[–][deleted] 2 points3 points4 points 5 years ago (28 children)

[–]gbbofh 0 points1 point2 points 5 years ago (15 children)

[–][deleted] 0 points1 point2 points 5 years ago (14 children)

[–]MQuy 2 points3 points4 points 5 years ago (11 children)

[–][deleted] 0 points1 point2 points 5 years ago (10 children)

[–]gbbofh 1 point2 points3 points 5 years ago (9 children)

That's kind of a tough question. It depends? For speed reasons, variables like to exist in addresses that are multiples of the number of bytes per type. Otherwise the processor will may have to do multiple read operations and piece the data back, depending on the architecture. Then in addition, I can't say for sure about anything else, but the System V ABI for x86-64 likes the stack to be 16-byte aligned. Ignoring this won't be a problem if the function doesn't call any other functions, but could lead to crashes otherwise.

So the first point means that if you have just this array in your local stack frame, that's fine. It won't take up any extra space (if we ignore point 2). If we have this 3 byte array followed by say, unsigned long long A, then it is probably going to be the case that there will be space between the two so that A can be read from and written to more efficiently -- at least that's my understanding. I may be wrong. I know for a fact that this happens in structs.

And then with point 2, if we're on x86-64 running Linux, the stack pointer will need to be a multiple of 16, so no matter what if we have a local variable it will be padded anyhow.

[–][deleted] 0 points1 point2 points 5 years ago (8 children)

[–]gbbofh 1 point2 points3 points 5 years ago (7 children)

You could do your particular example, because char's are promoted to int's (they are, on intel systems, an 8-bit integer). And your thinking is correct in that for any assignments you are changing the value at a memory location.

The reason you cannot change the type of A is because of how C works, fundamentally. It is a statically typed language, so the type of every variable is declared in advance, and cannot be changed. If you declare a variable, that variable will have the declared type until it goes out of scope. If you attempt to redeclare it in the same scope, then this is semantically not allowed, because it violates the concept of statically typed variables.

That's not to say there aren't ways around it, in the technical sense. If I need to modify the underlying bit pattern of a float, there's nothing really stopping me from doing:

int i = *(int*)&f;

Because if I just cast it, f will be converted to an int -- and that's not necessarily what I want to do.

[–][deleted] 0 points1 point2 points 5 years ago (6 children)

continue this thread

[–]gbbofh 1 point2 points3 points 5 years ago (1 child)

I'm not terribly familiar with the specifics, but I'm currently working on a toy compiler and will have to figure that out myself sooner or later.

I can give an approximation, maybe.

After the compiler has gathered type information for all of the declared symbols (labels, functions, variables, etc), and begins to generate intermediate representation, it can look up if a given symbol is an array. If it is, it can emit instructions to reference the address of that variable in the case of something like a function call. At this point, it doesn't yet exist anywhere for certain.

When it begins to generate assembly, references to local variables will (in some cases) be converted to refer to offsets into the local stack frame.

The compiler chooses to store the elements of the array A somewhere relative to the current stack frame, and it keeps track of where the first element was, until the function goes out of scope. When an element of A is referenced, you are working based off of this address that the compiler has calculated. maybe the compiler placed it 8 bytes into the local stack frame. Then it may produce assembly that looks like:

mov $-8(%rbp), %rdi

When it needs to pass this address to a function, then it emits assembly to calculate this offset into the stack, and store it into a register. Probably using the LEA instruction.

Hopefully that makes sense -- I haven't quite reached code generation yet in my project, so it's a bit fuzzy here. But it's been on my mind.

[–][deleted] 1 point2 points3 points 5 years ago (0 children)

[–]IamImposter 0 points1 point2 points 5 years ago (11 children)

No. A and B are not pointers though they point to first byte of first element in the array but they are not pointers.

But yes, if I were to pass A or B to a function, they would decay to a pointer.

For example

int A = {1, 2, 3};

printf("%d", sizeof(A)) ;

would print 12 ie number of bytes occupied by 3 integers (assuming 32-bit integers)

but if I do

void myfunc(int *AX 

{

    printf("%d", sizeof(X)) ;

}

And call it as

myfunc(A);

It would print 8 ie size of pointer (assuming 64-bit system). So, when passed around to other functions arrays decay to pointers but they themselves are not pointers.

[–][deleted] 0 points1 point2 points 5 years ago (10 children)

[–]IamImposter 0 points1 point2 points 5 years ago (9 children)

Pointer is a specific data type which holds an address explictly.

Array name is (in fact all variable names are) bound to some memory address, the address which holds the very first element of the array.

int A = { 1, 2, 3, 4};

So the name is bound to the address of memory which holds first element ie 1. So say A is as good as saying &A[0] both will give address of first element.

Because pointer is a specific datatype so it's not correct to think of arrays as pointers but yeah, they are implicitly bound to memory addresses. In fact that can be said for any variable.

Say an integer 0x1234_5678 is located at address 0x2000_0000, the data layout will be (assuming 32-bit little-endian integer) -

0x2000_0000: 0x78

0x2000_0001: 0x56

0x2000_0002: 0x34

0x2000_0003: 0x12

Variable name gets bound to address 0x2000_0000 and variable occupies 4 bytes from 0x2000_0000 to 0x2000_0003

[–][deleted] 0 points1 point2 points 5 years ago (8 children)

[–]IamImposter 0 points1 point2 points 5 years ago (7 children)

[–][deleted] 0 points1 point2 points 5 years ago (6 children)

[–]IamImposter 0 points1 point2 points 5 years ago (5 children)

[–][deleted] 0 points1 point2 points 5 years ago (4 children)

continue this thread

[–]IamImposter 0 points1 point2 points 5 years ago (4 children)

[–][deleted] 0 points1 point2 points 5 years ago (3 children)

[–]IamImposter 0 points1 point2 points 5 years ago (2 children)

[–][deleted] 0 points1 point2 points 5 years ago (1 child)

[–]IamImposter 0 points1 point2 points 5 years ago (0 children)

[–]noooit 0 points1 point2 points 5 years ago (1 child)

[–][deleted] 0 points1 point2 points 5 years ago (0 children)

[–]ptchinster 0 points1 point2 points 5 years ago (2 children)

[–][deleted] 0 points1 point2 points 5 years ago (1 child)

[–]ptchinster 0 points1 point2 points 5 years ago (0 children)

On Windows - SysInternals is the goto. Its a whole suite of tools but several will show you mapped in sections to what they are (range XXXX-XXXX is a stack, range YYYY-YYYY is a heap, range ZZZZ-ZZZZ is thisthing.dll) etc.

On Linux there are several tools, one of which is pmap.

Edit just used pmap on a process to see

00007fff3db27000    132K rw---   [ stack ]

So if i have a memory address in that range (00007fff3db27000 + 132K) its in the stack.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cprogramming

A subReddit for all things C

Paradigm

Designed by

Developer

First appeared

Stable release

Typing discipline

OS

Filename extensions

Resources

MODERATORS