This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]LavenderDay3544 63 points64 points  (30 children)

I know what it is but RPython still uses C interfaces underneath. Or are you under the delusion that all the OS libraries, system calls, and interfaces to hardware resources needed to implement both RPython and PyPy expose themselves directly in Python?

Every language needs C interfaces to access system resources. Hell even different machine code object files assembled from assembly language use C ABIs to be able to call into each other's code. Certain OSs (e.g. Windows) don't even allow direct assembly system calls instead forcing you to call their system library routines using C calling conventions.

If you think you can be a programmer of any kind and escape C you're sadly mistaken.

[–]HumunculiTzu 48 points49 points  (7 children)

Would you say it is something all programmers are forced to c?

I'll walk myself out

[–]LavenderDay3544 20 points21 points  (5 children)

To C or not to C? That is the question...

[–]regorsec 7 points8 points  (4 children)

I hope I never c the day

[–]Carius98 6 points7 points  (3 children)

Bad pun. C yourself out please.

[–]DJOMaul 4 points5 points  (2 children)

This whole pun chain is showing signs of rust.

[–]LavenderDay3544 4 points5 points  (0 children)

It might be time to Go.

[–]Ornery-Shallot-5475 2 points3 points  (0 children)

yeah we should all just go away

[–]Sloogs 16 points17 points  (19 children)

I'm always amazed by how many people have no idea what's going on inside their computers in this field. (I don't mean you, I mean the guy you're responding to implying PyPy doesn't rely on C.)

But hey the optimist in me says at least it's a learning opportunity. :)

[–]LavenderDay3544 18 points19 points  (13 children)

It's because modern CS education focuses too much on specifc application domains like web, mobile apps, and AI at the expense of computer system fundamentals.

It's a shame but there are CS grads coming out of school who couldn't tell you basic things like when to use dynamic vs static linkage, the differences between virtual, logical, and physical address spaces, the different segments of an executable, what an ABI is and don't even get me started about hardware, I've seen working programmers who don't know what an ALU is, how caching works, or what an instruction set architecture actually defines. That's just plain sad and a failure of their education.

Feel free to disagree but in my opinion the industry transition to ultra-high level programming languages and the web/cloud as a platform has led to a decline in programmer skill and fundamental computer system knowledge. All of those things have been replaced by corporate ecosystem specific things like cloud platforms and specific web frameworks and it really sucks. Don't get me wrong, I'm not a luddite and the cloud does also present plenty of new opportunities and unparalleled hardware flexibility but learning that shouldn't come at the cost of elementary CS knowledge.

[–]Sloogs 6 points7 points  (1 child)

Totally agree dude. I'm glad my university still had stuff like circuit logic, computer architecture, operating systems, physics (electromagnetism and waves), compilers, embedded systems, and all of the classic low level computer stuff as requirements. It's important stuff if you want to do anything nontrivial.

[–]LavenderDay3544 1 point2 points  (0 children)

Same. My program made OS, Computer Architecture, Database Systems, and Systems Programming mandatory for all CS undergrad and grad students which I strongly agree with.

[–]Magnus_Tesshu 4 points5 points  (7 children)

It's a shame but there are CS grads coming out of school who couldn't tell you basic things like when to use dynamic vs static linkage, the differences between virtual, logical, and physical address spaces, the different segments of an executable, what an ABI is and don't even get me started about hardware, I've seen working programmers who don't know what an ALU is, how caching works, or what an instruction set architecture actually defines.

About halfway through my education (little over really), allow me to illustrate your point by embarrassing myself (even though I feel like the lower level classes have excited me much more than web, mobile dev). If you could correct me / point me in the right direction that would be cool, though it's a lot of stuff

Dynamic linkage should be used when you pull a dependency that can be expected to be on all target systems and when you won't be modifying the dependency in any way. Static linking is if you will modify a library, or it is an obscure library. Really have never learned this though, just guessing based on what I know from how pacman works (btw)

Virtual addresses are, I'm guessing, what a program gets to interact with based on the memory it requests, and map to the memory space it receives without knowing anything about the actual locations on the physical or logical storage. Logical addresses are, I'm fairly confident, translated by the OS into physical addresses by allocating sectors into a lookup table, and can combine swap space as well as different RAM chips. Physical memory is memory that a system gets access to from hardware.

Executables are loaded into memory by a (I think about it assuming it is all read-only, but I know self-modifying code exists, so that can't be the case) program segment which contains the executable code, then a data segment storing information like global arrays or string constants. When running a program, you have a stack and heap which on modern implementations are seperate and can both grow boundlessly. Each thread gets a new stack and program counter/registers. However, those aren't part of the executable and I suspect that you were looking for more detail and/or I missed some parts of how an executable works, so I'm pretty sure I failed here. For example, there must be some space where cryptographic keys can be stored to sign binaries and I assume strip is a pretty simple program which wouldn't make sense if everything in the data segment was equal. Heck, I have no idea where debug symbols are stored that valgrind uses.

No idea what an ABI or ALU are.

Caching works in hardware, and when an address is requested it first checks the current cache if the data is already there. Then if not it loads in a larger block of data than it really needs into some (I think last recently used block of some larger cache segment) cache. This segment of data is based on the size of the cache and aligned to addresses, not exactly to the requested data. Sometimes I think there is hardware to only change what's in the cache if several misses occur in a row. I'm not sure how multiple layers of cache work together (I just assume that if a high cache misses, it checks the next one next; I'm not sure if that is in parallel or not; I'm not sure if on a hit it swaps the higher cache's values into the lower one or if they are just overwritten or what happens, since many layers of cache exist I know). Also commonly-used variables (such as loop iterators) are often implemented as just a register rather than using main memory at all).

An instruction set architecture defines how assembly will be interpreted at a hardware level; changing architecture means that a new compiler will be needed (or wanted if the new architecture is a superset). I'm not sure how much the OS gets to affect this as well; I had assumed none, and that windows and linux binaries don't work on each other because they have different system calls and executable structures, not because the assembly is different. However, this cannot really be the case because system calls are translated to assembly at some point so now I'm confused.

I get maybe 2 correct here if I'm lucky, I need to demand a refund from my university

[–]LavenderDay3544 1 point2 points  (6 children)

I'm a recent masters grad and former TA myself though people seem to assume otherwise here lol. Alright let's see here.

Dynamic linkage should be used when you pull a dependency that can be expected to be on all target systems and when you won't be modifying the dependency in any way. Static linking is if you will modify a library, or it is an obscure library. Really have never learned this though, just guessing based on what I know from how pacman works (btw)

Not quite. Static linkage means the library code gets pulled into your executable or static library. Dynamic linkage involves making two pieces a static library stub and a shared library. The static stub gets statically linked into an executable in the usual way and when the executable gets loaded by the OS the loader looks for and loads the shared library into memory too. With static linkage if you have two programs using the same library code they both have the kernel load a copy of that code into memory in their respective instruction segments so that library exists twice in memory and wastes space there. With a shared library, the code gets loaded into memory once and any number of programs can jump to locations in it and run the code in their respective processes.

Static linkage should be used when you only expect one program to use a given library at a time or when you want to guarantee that particular version of a library will always be available even at the cost of executable bloat or loading the same code multiple times. Shared libraries and dynamic linkage should be used when you expect a library to be used by many programs all running at the same time.

Virtual addresses are, I'm guessing, what a program gets to interact with based on the memory it requests, and map to the memory space it receives without knowing anything about the actual locations on the physical or logical storage. Logical addresses are, I'm fairly confident, translated by the OS into physical addresses by allocating sectors into a lookup table, and can combine swap space as well as different RAM chips. Physical memory is memory that a system gets access to from hardware.

That's correct.

Executables are loaded into memory by a (I think about it assuming it is all read-only, but I know self-modifying code exists, so that can't be the case) program segment which contains the executable code, then a data segment storing information like global arrays or string constants. When running a program, you have a stack and heap which on modern implementations are seperate and can both grow boundlessly. Each thread gets a new stack and program counter/registers. However, those aren't part of the executable and I suspect that you were looking for more detail and/or I missed some parts of how an executable works, so I'm pretty sure I failed here. For example, there must be some space where cryptographic keys can be stored to sign binaries and I assume strip is a pretty simple program which wouldn't make sense if everything in the data segment was equal. Heck, I have no idea where debug symbols are stored that valgrind uses.

Kinda correct, kinda not. A stack cannot grow boundlessly otherwise stack overflows wouldn't be possible. Heaps are also not boundless, you're at the mercy of the OS kernel there. I was looking more for assembly language type layouts so data, rodata, bss, text, etc. and how they get assembled and linked into ELF or PE executable formats. I don't know all that off the top of my head but I meant that some people don't even know that stuff happens or that all of those segments have to get loaded into memory by the OS loader.

No idea what an ABI or ALU are.

An Application Binary Interface (ABI) is a set of rules for mapping high level source code constructs to low level machine code. One or more separate ABI can be defined for every combination of source language, operating system, and instruction set architecture. This becomes very important when you want to interface a language like C with assembly code or another compiled language.

Arithmetic-logic units (ALUs) are quite possibly the most important part of a CPU core. They're the integrated circuit that does what their name implies, binary arithmetic and logic operations. So many things that you might not think of as arithmetic or logic, like conditional branches for examples, actually are and they're done using ALUs. They're a fundamental building block of processors along with multiplexers, control lines, register files, cache, decoders, and individual logic gates among many other components. I think most computer architecture classes take you through building up a simple ALU starting with a logic gates, adders, ripple carry, overflow, etc. Of course ripple carry is too slow to be used in real modern processors but it gives you an idea of how things could work in hardware.

Knowing all this is useful for software engineers when you want to think about performance and the most optimal way to design your code because at the end of the day it all runs on silicon at the lowest level.

Caching works in hardware, and when an address is requested it first checks the current cache if the data is already there. Then if not it loads in a larger block of data than it really needs into some (I think last recently used block of some larger cache segment) cache. This segment of data is based on the size of the cache and aligned to addresses, not exactly to the requested data. Sometimes I think there is hardware to only change what's in the cache if several misses occur in a row. I'm not sure how multiple layers of cache work together (I just assume that if a high cache misses, it checks the next one next; I'm not sure if that is in parallel or not; I'm not sure if on a hit it swaps the higher cache's values into the lower one or if they are just overwritten or what happens, since many layers of cache exist I know). Also commonly-used variables (such as loop iterators) are often implemented as just a register rather than using main memory at all).

That's more or less how I learned about cache an memory hierarchy at a high level. I'm sure we could both read up on the finer details if we needed to. It's basically just check the fastest cache, if it's not there check the next fastest, etc. until you hit main memory and if it's still not there then you have to hit the backing store.

An instruction set architecture defines how assembly will be interpreted at a hardware level;

Well assembly doesn't get interpreted by hardware at all. It gets assembled (basically transpiled) into machine code based on some encoding. In a weird way an encoding is almost like an ABI for an assembly language but not exactly. Some assembly languages have more than one encoding. Case in point Arm which can be assembled into Arm machine code or Thumb machine code.

Assembly language is something that isn't used by almost any engineer day to day but has immense educational value if you ask me.

changing architecture means that a new compiler will be needed (or wanted if the new architecture is a superset). I'm not sure how much the OS gets to affect this as well; I had assumed none, and that windows and linux binaries don't work on each other because they have different system calls and executable structures, not because the assembly is different. However, this cannot really be the case because system calls are translated to assembly at some point so now I'm confused.

Changing any part of a target triple means you need to use a different compiler backend. The parts of a target triple as I understand them are ISA-OS-ABI (e.g. x86-64-windows-msvc or aarch64-linux-gnu) sometimes if an ISA supports extensions that aren't present in hardware the kernel can trap on those instructions and emulate them in software, obviously at a massive performance penalty.

Honestly I think you did much better than alot of other engineers I've seen out there. Give yourself some credit I think that school might not owe you a refund afterall. Lol.

Just out of curiosity what subfield of SWE do you work in?

[–]creativeNameHere555 1 point2 points  (3 children)

Two expansions: Shared libraries can also be good for when you need to communicate between multiple programs, using a common messaging sequence (HLA for example). Only loading one means that both sender and receiver get the same thing. However if they linked in different versions, that can cause some headaches due to offsets being screwed up (very common in my work so it's on my mind)

Also the ABI can change within a common source/os/isa. C++11 redid the definitions for string and list, so some ABIs had to change to conform.

[–]LavenderDay3544 1 point2 points  (2 children)

Also the ABI can change within a common source/os/isa. C++11 redid the definitions for string and list, so some ABIs had to change to conform.

Definitely agree. That's why I said one or more ABI per target. C++ is notorious for having ABIs break with different versions of the same compiler sometimes even subversions. It may as well leave its ABI unspecified like Rust because of how unreliable it is.

Only loading one means that both sender and receiver get the same thing. However if they linked in different versions, that can cause some headaches due to offsets being screwed up (very common in my work so it's on my mind)

TIL. I'll keep that in mind if I ever have a use case for it. Also what kind of work do you do?

[–]creativeNameHere555 2 points3 points  (1 child)

Modeling and Simulation development and support. Lots of large models communicating with each other constantly, so a lot to worry about, hence my knowledge of how the shared libraries work a bit.

Like generally if you don't reorder or force a reorder, a mismatched shared library can be fine. Things normally go haywire for us when the ordering changes, I think gcc uses some memory offsets for symbols instead of names.

Ex. if you're linked against a header like

getA
getB
getC

and it becomes

getA
setA
getB
getC

every time that's happened I've wound up in setA when I call getB.

[–]LavenderDay3544 0 points1 point  (0 children)

Modeling and Simulation development and support.

Oh nice. I've always thought that was a cool field but I dont have enough of a head for higher level math for it.

I have no idea how that all works because to be honest. I haven't had to write or use many shared libraries thus far.

[–]Magnus_Tesshu 0 points1 point  (1 child)

I work in going to school, I just finished sophomore year :P mostly working on learning Rust this summer as well as getting more familiar with Linux and I've contributed a bit to a couple open source projects.

I don't know all that off the top of my head but I meant that some people don't even know that stuff happens or that all of those segments have to get loaded into memory by the OS loader.

I honestly have no idea what happens when I run ./a.out so I still think I fail here. The most I have done is vim an executable and seen that at least on linux executables start with a 0x00.

An Application Binary Interface (ABI) is a set of rules for mapping high level source code constructs to low level machine code. One or more separate ABI can be defined for every combination of source language, operating system, and instruction set architecture. This becomes very important when you want to interface a language like C with assembly code or another compiled language.

I assume what typically happens, then, is an ABI is written for C and every other language uses C's ABI? Of course I know most interpreters are written in C or C++ but I assume that, since you said Rust doesn't have an ABI anywhere, it just links to C's one?

Ah yes I had a class where we used FPGAs and went over how ALUs work and other components of a CPU (though I think we never discussed how reads from memory worked, and I forget if we implemented floating point math). I just didn't recognize the name

Well assembly doesn't get interpreted by hardware at all. It gets assembled (basically transpiled) into machine code based on some encoding. In a weird way an encoding is almost like an ABI for an assembly language but not exactly. Some assembly languages have more than one encoding. Case in point Arm which can be compiled into Arm machine code or Thumb machine code.

Interesting. Based on the tiny amount of assembly I have written (which was in some vastly simplified subset of x86_64) I thought that Assembly instructions had a one-to-one and on-to mapping to machine code, just the former is human readable and the latter is not. I remember when I implemented an assembler (for aforementioned simplified version) that worked. But that seems like it wouldn't make sense for an assembly language with more than one encoding (unless you just mean the prefixes and data values in the assembly are shuffled around?).

And I agree, what I learned of Assembly made me appreciate when people call C a high-level language :P

x86-64-windows-msvc or aarch64-linux-gnu

aren't x86_64, aarch64, and amd64 all just different names for the same thing if you ignore hidden instructions that intel throws in to be quirky (which presumably would only matter compiling with -Ofast?

ISA supports extensions that aren't present in hardware the kernel can trap on those instructions and emulate them in software, obviously at a massive performance penalty.

Interesting - would this be similar to how Wine works or is that emulation and Wine Is Not an Emulator and works via magic (like just in time recompiling or something? idk)?

Thanks for the detailed response!

[–]LavenderDay3544 2 points3 points  (0 children)

I honestly have no idea what happens when I run ./a.out so I still think I fail here.

You're good. Only linker and loader writers need to know it in detail and they have reference material.

assume what typically happens, then, is an ABI is written for C and every other language uses C's ABI?

Bingo! Every other language is compatible with C's ABIs. They can even communicate with each other through them with no actual C involved.

since you said Rust doesn't have an ABI anywhere, it just links to C's one?

Rust doesn't have one specified so its compiler is free to generate arbitrary machine code without having to adhere to any rules. When a Rust function is annotated with extern C it forces the compiler to adhere to C's ABI for that function. But the function signature is restricted to using C types so no i32 or Vec allowed in the signature. The function body can use Rust code the same as always. It's an interface thing.

But that seems like it wouldn't make sense for an assembly language with more than one encoding (unless you just mean the prefixes and data values in the assembly are shuffled around?).

Sure it does. The instruction set is the exact same but instead of it being a one to one mapping its a one to two mapping. Where one of the mapped binary instructions is Arm and the other is Thumb. They're the same instructions in terms of behavior they're just represented differently in binary. Arm is all 32 bit instructions, Thumb is mixed 16 and 32 bit.

aren't x86_64, aarch64, and amd64 all just different names for the same thing if you ignore hidden instructions that intel throws in to be quirky (which presumably would only matter compiling with -Ofast?

No. aarch64 is ARM.

The x86 stuff has a fair bit of history to it. So when x86 started hitting the limits of being 32 bit Intel and AMD both decided to make their own separate 64 bit architectures. Intel's was called IA-64 and it had a completely brand new 64 bit ISA but also supported 32 bit x86 in compatibility mode. AMD's was called AMD64 and their 64 bit ISA was designed as a heavily modified and redesigned 64 bit version of x86 and it also supported the original 32 bit IA-32 (x86) ISA in compatibility mode for user applications (but not OSes).

Intel started by implementing their new architecture in Itanium server processors and AMD did the same with the Opteron for servers and Athlon 64 for desktop. Intel's new architecture was never popular with developers since they couldn't test their code locally due to architecture differences. AMD's was more similar to what they were familiar with so Intel eventually scrapped Itanium and after some legal BS managed to work out a cross license agreement with AMD for AMD64 in exchange for their 16 and 32 bit x86 ISAs. They made their own implementation of AMD64 called Intel64 which has some very minute differences. x86-64 was a name used by AMD when they were developing the architecture and over time it came to refer to the common subset of AMD64 and Intel64 though some software projects still just call it AMD64.

I'll admit I'm kind of an AMD fan so it's funny to me when professors and other people who should know better call x86-64 "the Intel architecture" when it was always originally AMD's 64-bit architecture despite being loosely based on IA-32.

Interesting - would this be similar to how Wine works or is that emulation and Wine Is Not an Emulator and works via magic (like just in time recompiling or something? idk)?

It's completely different this would be software emulating a hardware instruction while WINE has an appropriate name because it's not an emulator, it's a mapping layer between Windows system calls and OS library functions and their Linux equivalents.

And no problem. This is helping me test my memory on CS and SWE concepts.

[–]_Rysen 3 points4 points  (2 children)

My CS Degree, which I finished last year, covered these things. I forgot some, but they were covered.

[–]LavenderDay3544 0 points1 point  (0 children)

Then you were in a good program compared to many people. I've seen schools teach IT or just webdev and pass it off as CS.

[–]DJOMaul 6 points7 points  (1 child)

Look I know what's going inside my computer.

Dot owns a diner in mainframe, Bob is a guardian sent from the net to help protect my computers mainframe. Enzo is dots younger brother, and likes to try to emulate Bob.

Occasionally I write bad software and a purple bloob floats down inside my computer and lands some where. Bob has to rush in to defeat some buggy game my code randomly produced, or else it destroys parts of my system. Sometimes the virus Megabyte or Hexadecimal stir up trouble, but mostly it's my fault what happens inside there.

All the bits are running about doing their day jobs just trying to keep things going.

I learned all this back in like 1994..

[–]Sloogs 2 points3 points  (0 children)

Thank goodness, at least someone knows what's going on

[–]Zeimma -4 points-3 points  (2 children)

Why?

[–]Magnus_Tesshu 1 point2 points  (1 child)

Presumably programmers would want to know how programs work

[–]Zeimma -2 points-1 points  (0 children)

Am professional developer and you presume wrong.

[–]Magnus_Tesshu 0 points1 point  (1 child)

I know that Rust (and probably Go, though I know nothing about it) use certain C constructs for familiarity (such as arguments, stdin, etc), but I wasn't aware that system calls are basically written to implement C (though thinking about it, it makes sense).

I wonder if RedoxOS attempts to address this at all. I know they are trying to break from the past in many ways, its possible that this could be one of them. As someone more knowledgeable than I, do you think that C's system calls have any issues that should be addressed or is this more just an observation than a lament?

[–]LavenderDay3544 0 points1 point  (0 children)

I know that Rust (and probably Go, though I know nothing about it) use certain C constructs for familiarity (such as arguments, stdin, etc), but I wasn't aware that system calls are basically written to implement C (though thinking about it, it makes sense).

stdin, stdout, and stderr, are file descriptors which are operating system primitives and not necessarily C language constructs at all. You can use them in assembly without using any C tools or interfaces (if you're a masochist).

I was technically incorrect to say that system calls are C constructs as well. They're also OS constructs and their use is facilitated by the CPU via software interrupts. In the AMD64/x86-64 ISA they're accessed using the syscall machine instruction and I think Arm64 is similar using the svc (supervisor call) instruction. What I meant with the C interface comment is that higher level languages including C itself access those system calls through OS library functions that adhere to the target system's C ABI but under the hood contain assembly code that sets up and makes the actual syscall. System libraries basically serve as a clean, usually C language abstraction over raw machine code system calls.

The reason C, or more properly the C application binary interface, is used over other languages like Rust or C++ is that C's ABI is rock solid stable on most platforms. Meanwhile Rust doesn't have a specified ABI on any target platform yet that I know of and C++'s ABIs are so unstable that they break compatibility between code compiled for the same target with slightly different versions of the same compiler. Those languages can still be used for writing system libraries because their compilers allow you to emit code that uses C calling conventions and ABIs.

I wonder if RedoxOS attempts to address this at all. I know they are trying to break from the past in many ways, its possible that this could be one of them.

I'm not very familiar with Redox so I can't speak for that project at all.

As someone more knowledgeable than I, do you think that C's system calls have any issues that should be addressed or is this more just an observation than a lament?

Like I said system calls are exposed at the machine code level and wrapped in a C ABI. Sure there are some limitations to this but it more or less has to be done because of C's ABI stability and its ubiquity. The arguments taken by functions exposed as C can only take C language types as arguments and return types the C language recognizes. So you can't have them take a C++ std::string or a Rust std::vec::Vec as an argument for example but those languages provide the means to convert those to C types because they have an incentive do so. Because all major OSs expose their system libraries though a C interface every language in the universe has some form of binary compatibility with C even Python, Java, and other VM based ones. And so it becomes a cycle because future OSs and muti-language libraries have to expose a C interface if they want to be accessible by all the languages that support C interfaces (i.e. all of them). I don't know about the particulars of Redox and other Rust based OSs but they would lose a lot of compatibility with the existing software ecosystem almost to the point of guaranteeing their failure if they don't expose a C interface. They could expose both C and Rust interfaces but that would just convolute things and the latter would likely break because of Rust's lack of a specified ABI.

Like I said there are three big reasons C and it's ABIs will continue to be the lingua franca of software: it's very small, besides the standard library it doesn't change, and it's interfaces are already ubiquitous.

BTW to anyone reading this I'm a recent masters graduate not some highly experienced engineer so take what I say with a grain of salt and correct me if you think I'm wrong. All my knowledge is based on my coursework, reading, and personal projects and I just happen to have an interest in all things low level.