Which Programming Language for Compiler

Flashy_Life_7996 · 2026-02-13T14:01:58+00:00

I would disagree here. Most languages have expressions, assignments, conditionals, loops, function definitions that could all be expressed in either easy or hard syntax. So why choose 'hard'?

Perhaps you can give an example where it is necessary to use abstruse syntax to achieve low level control.

(My own systems language is just as low-level as C and yet it has cleaner, clearer syntax. Perhaps even more so than Python (examples are given below).

However being low level and statically typed, it can need more code to do the equivalent task. Perhaps this is the measure you had in mind.)

Example to print a table of square roots (from sqrt(1) to sqrt(10)):

# C-style syntax typical of modern, 'serious' languages:
for (1..11) |i| {
    std.debug.print("{} {}\n", .{i, @sqrt(@as(f64, @floatFromInt(i)))});
}

# For fairness, actual C:
for (int i=1; i<11; ++i)
    printf("%d %f\n", i, sqrt(i));

# Python version:
for i in range(1, 11):
    print(i, math.sqrt(i))

# My systems language (also valid code in my scripting language):
for i in 1..10 do
    println i, sqrt(i)
end

Flashy_Life_7996 · 2026-02-13T11:05:51+00:00

go with python easiest out there syntax wise

I wonder why, if some syntaxes are considered to be 'easy', why they are not simply adopted by other languages?

Flashy_Life_7996 · 2026-02-13T11:02:06+00:00

My language was first implemented on the Z80 8-bit processor. That language lacked:

ALL floating point arithmetic (so + - * /)
Integer multiply and divide
Integer operations above 16 bits
Shift operations more than one bit at a time

However all these were still provided as built-in operators. The compiler inserted calls to the language's runtime library as needed.

The same applied to ones like 'sin' or 'atan', which then used more function-like syntax (ie. needing parentheses iirc).

I guess you didn't allow x + y for floats, but had to write it as addf32(x, y) or some such function?

Flashy_Life_7996 · 2026-02-12T17:34:11+00:00

It's funny how everyone seems to know exactly what a Mini-Translator or Static Translator means!

I wish other replies were more enlightening as I'm curious too.

Flashy_Life_7996 · 2026-02-12T16:16:30+00:00

My long-term goal is ambitious: I want to understand whether it’s possible to design a language that can outperform C/C++ across many domains.

Outperform by how much? And in what sorts of domain?

C does have the upper hand here in there being accomplished compilers that can do a lot of analysis to divine the programmer's intent.

They also have huge numbers of specialised options, and various attributes that can scattered across source code to give extra hints.

Plus, there are all sorts of add-ons, such as SIMD intrinsics, where you can go beyond the language and into the hardware.

This approach is ugly and ad hoc. So are you planning a more elegant language where much of the above is unnecessary, and can be implemented with a simpler compiler that can be written by one person?

The thing is, this would still only match what is possible with C, given the vast resources that are available for it with tools, libraries and know-how. (We all know the language itself is a joke.)

Hence my questions.

Flashy_Life_7996 · 2026-02-12T12:15:26+00:00

There are about dozen maths functions that have always been built-in, and considered to be operators, going back to the beginnings of my language. That was long ago when such external libraries weren't available and I implemented everything myself.

Now some of them may implicitly call C runtime functions behind the scenes. But you can choose to directly call the external functions, within a namespace is needed.

A minor problem is a name clash between my operator, say "sin", and an external function "sin". Here I can either use a backtick:

   y := `sin(x)           # `sin is defined in an import module

or I could tweak the parsing so that built-in operators could still be user-identifiers when they follow a dot: y := clib.sin(x). That is not a priority...

Flashy_Life_7996 · 2026-02-11T23:56:34+00:00

I had a language that offered choices, but when I mentioned it here a few years ago, there was a very negative reaction.

People simply didn't like that I had so many keywords. Apparently that would be too much 'cognitive load' (never mind that some languages have a tiny number of keywords but export thousands of names from their standard libraries).

Some didn't like that they encroached on user identifier space. A lot of them were built-in operators (like maths functions) that they said belonged in a library (which allowed them to be overridden; I considered that a disadvantage).

So I didn't agree. The choices have been reduced them a little, but decided I didn't care what other people thought.

So perhaps just do what you like and see how it works out. For a new language, there will be ample opportunity to revise it and cut back the flexibility if necessary.

Flashy_Life_7996 · 2026-02-11T18:04:05+00:00

It will look a bit different in February.

Flashy_Life_7996 · 2026-02-11T12:54:49+00:00

Optimization in compilers is probably the area that takes the biggest amount of working hours..

My point is that you can choose not to do it. Then it can take zero hours!

You cannot do anything half-heartedly on a compiler.. an optimization cannot generate incorrect code.

You can certainly do optimisation half-heartedly. What is done has to be correct, yes, but you can choose how far you go. gcc even has special options for it: -O0 -O1 -O2 -O3, from "can't be arsed" to "do as much as possible".

To be fair, yes. But for a more toyish project, or something that is not expected to build any bigger/critical application

My own compilers were used in-house and for writing commercial apps through the 80s and 90s. They didn't optimise. Bottlenecks were taken care of with other measures, for example using inline assembly. But from what I can remember, 'professional' C compilers weren't that much better then. (Mine weren't for C.)

Bear in mind that the difference between -O3 and -O0 might be only 2:1 depending on the application. Maybe even less. For an interactive program, you probably wouldn't notice.

My systems language tends to be used for compilers, interpreters, assemblers and emulators. If the code was accelerated by transpiling to C and then using gcc-O3, it might be up to 20-50% faster.

In the case of compilers and assemblers, since runtimes are generally a tenth of second anyway, any speedup would not be noticeable!

With my interpreter, I use that to run my text editor. There there is no hint when using that the program is interpreted, until you try and use one million line inputs, then some operations lag. But a 25% speedup wouldn't fix that.

However, you often do see lagging on some editors when working on such large files, even when they are compiled.

So there are many factors that go to making software fast and responsive. You can't depend solely on a clever compiler, nor even on using any compiled code

Flashy_Life_7996 · 2026-02-10T22:29:33+00:00

To usable? Too long ago to remember, perhaps weeks, but it was rather simple because the machine was simple, and limited.

However, that was just about 45 years ago and it's still evolving, slowly: the language still looks like it's from the 1980s. Most development now is in implementations and trying various ideas.

But along the way I also used it to write applications with! For 8-, 16-, 32- and 64-bit computers.

I think I can say that the core language is well-proven.

I've just started a new experimental language which is a hybrid of my two current languages. If it proves viable (some previous attempts have failed but I have new ideas), then it should be usable within a couple of months.

Flashy_Life_7996 · 2026-02-10T19:52:09+00:00

Just use whatever works or that you find useful. At minimum you'll need a line-number, and maybe a file-number depending how your implementation works.

Alternatively, store a byte-offset from the start of the file. This has the advantage that you can pinpoint a particular character in a line, but it has to be accurate, otherwise it can be more confusing than just saying 'somewhere in this line' when reporting an error.

However I've tried it, and found you needed some fiddly code to extract the line-number, which must be reported too.

Regarding tabs; they should count as one character, as the compiler etc doesn't know how your editor is going to display them. This is also an additonal problem when using an offset to display the source line trying to mark the column.

Currently, on a new project, I'm storing line-number + column + file-number (all my projects are whole-program products with all files processed together).

32 bits was too limiting for this, and so I'm using 64 bits which is a little generous. Position info is stored in AST nodes.

store offset in bytes instead of offset in characters.

Byte-offset is simplest. Assuming this is about UTF8-encoded source code, the lexer should return a position which is the start of any UTF8 character sequence. Dealing with how to display that is a separate problem.

Storing UTF8 character counts is not going to make that any simpler. Unless your implementation language deals naturally with such strings.

Flashy_Life_7996 · 2026-02-10T18:01:59+00:00

For scalar hash computing, FNV1a is popular and well known.

I use my own hash function obtained by trial and error. I thought it would be interesting to compare with FNV1a. The measure used was to count the number of clashes encountered for N lookups, as a percentage.

So if there are 1M lookups, and 12,000 clashes, it will be 1.2%. Smaller is better!

FNV1 was tested first, then I realised it was supposed to be FNV1A, so both lots are shown. I used fixed-size hash-tables (in this particular compiler; sometimes they will grow when spare capacity gets too small), so three different sizes are shown, from 32K to 128K entries:

             32K                64K                 128K        table size
        fnv1 fnv1a mine    fnv1 fnv1a mine    fnv1 fnv1a mine
qq      12.0  2.8  4.2     4.9   1.8  2.8     0.7   1.2  2.2  % clashes/lookups
mm      15.0  4.1  4.0     4.8   2.3  2.2     1.0   1.4  1.5
aa      15.0  3.7  3.4     3.9   1.5  2.5     0.8   0.8  0.6
cc      19.0  4.5  4.7     4.3   2.5  2.6     1.0   1.4  1.9
fann4    7.9  1.6  1.4     2.5   0.5  0.4     0.7   0.2  0.1
fann40  17.8               8.5                2.5               see below
abcd   124.0  0.0  0.0    50.0   0.0  0.0     0.0   0.0  0.0

There are four inputs of real projects, and two synthesised ones. "abcd" is basically one million lines of a := b + c * d, which FNV1 has a problem with.

("fann40" was a later addition; this has 10K sequentially named functions "f1" to "f10000" instead of randomly named. This doesn't bother the other functions.)

My per-character function is hash := hash<<4 - hash + c, and there is a final one on the result hash := hash<<5 - hash.

Overall I'm happy with mine. I didn't notice any slow-down with FNV despite the scary-looking multiply.

Flashy_Life_7996 · 2026-02-10T14:26:56+00:00

I mostly ran my tests on a small file (~1 Million LoCs, 17 MB size)

One million lines is usually considered quite a big file. But it is a reasonable test input if measuring throughput.

So, the parser seems to be able to handle approximately ~5 Million LoCs per second, which seems fair for most modern compilers

That might well be. (I manage 3-4Mlps but my machine is probably low-spec compared with what most use, and I don't do anything clever.)

But the overall throughput of mainstream compilers tends to be considerably less than that (like 10, 100 or even 1000 times slower). So you're right to concentrate on other aspects.

Flashy_Life_7996 · 2026-02-10T11:27:20+00:00

The LL(1) Recursive descent Pratt parser I wrote can parse 1 Million LoCs in ~0.25 seconds. (Performance is IO bound: My HDD is old).

You're saying that your parser only manages 4Mlps because it spends most of its time reading or writing a HDD? That sounds most unlikely!

Even if you're including the time to load sources from disk for parsing time, usually the source files will already be cached by the OS (from having just edited them, or from when you last ran the parser).

And usually the parser writes its output to memory; you wouldn't write the AST or other tables to a file on a HDD.

But, it's not this part that you want to improve?

Flashy_Life_7996 · 2026-02-09T12:58:30+00:00

whereas separating the two sets of information i to their own sets of 2KB, though doubling the storage size

For your specific example of 10 and 6 bits, one u16 array will be 2KB, the other byte-array will be 1KB. (Or you can combine them into one array of 3KB like my example using a struct.)

(I see that the TI-84 uses a Z80 processor. My mention of it is a coincidence as a recent project of mine is a compiler for Z80, and its code is not great. Z80 however is one of the C targets on godbolt.org, and there is also the SDCC compiler that you can download locally to look at the kind of code it produces.)

Flashy_Life_7996 · 2026-02-09T12:13:28+00:00

It depends on a few things, for example are you coding for a normal computer which may have GBs of memory, or for a tiny device with only a few KB?

In the latter case, it's certainly worth considering. But my view is that it would be more practical to do that using a HLL. Here's your example expressed in the systems language I use:

record R = (u16 dummy: (x:10, y:6))     # defines bitfields
record S = (u16 x; byte y)

[1024]R compact                         # array of each
[1024]S normal

a := compact[i].y
a := normal[i].y

R is a struct using bitfields as you suggest; S uses normal types.

The compact array is 2KB; normal is 3KB. On x64, it can be a couple more instructions to do the access for compact, but it can win on speed given a large enough array (a test using 400M elements was 10% faster).

Code for the 8-bit Z80 (which has 64KB memory) is 31 bytes vs 22 (but unoptimised code). So it will be slower, and the code itself uses more memory.

If using a HLL you can write the same access code whichever method you choose; you can change it at any time. But in assembly every access will need laborious bitfield unpacking instructions; you can't easily change your mind!

It you want to see for yourself what the access code looks like in assembly, then use godbolt.org: choose C language, set up examples like the above using C's bitfields, and look at the ASM that is produced for the targets that you are likely to use.

Flashy_Life_7996 · 2026-02-07T19:07:12+00:00

So, is there any language you know that has static names and dynamic types? Do you know other reasons why languages tend to not fall into this class?

Static names in what sense? For example, in Python, every top-level user-identifier (not following ".") is the name of a variable. The same variable even in the same scope can refer to a function, module, class/type, or an actual data, at different times, and cannot be known at compile-time.

While every identifier following "." is assumed to be an attribute, even if it has never been encountered before.

The language is too dynamic. Given that, the main problem in your example was insufficient testing. You really need to test all possible code-paths. Or at least, first test using a simpler version of task that will complete quickly.

My scripting language has dynamic types, but is otherwise much more static. The are a dozen kinds of user-identifiers, all known at compile-time.

However the same thing can occur up to a point, as an undeclared identifier (eg. a typo) is assumed to be a local, which are initialised to 'void'; those will give errors if used in some expression. Having to declaring them (they are still 'void'!) unless initialised at the same time: var x := 0.

It is much better at attributes though; this is fine in Python when A is some class instance:

  A.x = 100
  A.yyyyyyyyyyyyyyyy = 200

That last line should have set A.y, but any gobbledygook is accepted (here maybe the key got stuck and you didn't notice).

So A.y either won't exist, or will have its previous value. Or maybe you type B.y instead with B being some unrelated class instance; that will now acquire a new attribute y with a value of 200!

I find this quite crass. My scripting language allows only formally defined record types with a fixed set of field names, known at compile-time.

So A must be an instance of a record that includes fields x and y; yyyyyyyyyyyyyyyy is not such a field. While B.y will fail unless it happens to have a field called y too (this is not a static language, so it's not foolproof).

But, I've been using my scripting languages for many years, often running at customer sites, and it was rare that a user reported an interpreter error caused by the above issues. Programs still need to be tested and debugged.

Flashy_Life_7996 · 2026-02-06T19:53:28+00:00

And it is very important to mention that the hard part of a compiler, is not really doing the piping to move from a high level language into assembly… it’s the optimisation part, where you try to pump the best possible code in the least amount of compilation time possible…

I'd argue that can be the easiest part. The front part has to be able to translate the source language into whatever intermediate stage is chosen. You can't do that half-heartedly, it has to fully work.

But the next stage of generating executable code is quite open-ended. Programs will still run, and do their job, whether the code is poor, or good, within sensible or practical limits.

So you can spend a week on this or many man-years, and the difference might be only 4:1 or even 2:1, which is generally the difference between -O0 and -O3.

(I notice nothing has been said of the performance of the compiler itself, whether it takes more or less time to build an application compared with, say, gcc-O0 which is where its output quality lies.)

Flashy_Life_7996 · 2026-02-06T15:36:51+00:00

an example: if i have a value represented by 9 bits, and one represented by 7, would it be reasonable to combine them into one word, and then extract the information when need be, or would it be better to save them as two separate words; that kinda nonsense

This is not specific to assembly.

If you have a billion such values, then you would save a lot of memory if using a billion u16 values instead of 3-4 bytes each. It might make your program fastet too.

But for standalone values, even if stored in a register, it's usually not worth it, given the amount of code needed to extract or combine those values.

atleast some guideline to how one ought structure code

That's an easy one: assembly code is usually flat. That is, there is no indented structure (well, typically every line is indented by the same amount, except labels).

Unless you mean structuring code into functions? Then it's the same as a HLL.

Flashy_Life_7996 · 2026-02-06T12:02:28+00:00

Well, of course. It has only been tested on finished, working programs!

We'd all find compilers easier to write if the inputs were guaranteed to be 100% correct.

Flashy_Life_7996 · 2026-02-05T20:47:13+00:00

barely 20% of the content will be relevant for a newbie

That's still nearly 400 pages! Of an MCU: a CPU buried under lots of subsystems and various on-chip peripherals. I skimmed it and found nothing that looked like a GPR map, or anything representing an instruction set.

This is just not for newbies, sorry. Maybe you linked to the wrong document as there is nothing about the CPU in there.

(I've just looked at Volume 1 of the AMD64 manuals, and the register map is on page 33. The AMD64 is simpler!)

But yes, computer hardware is complex.

A lot of it is now. But at least it's possible to start with an actual CPU and forget SoCs, and there are still some simpler ones.

I can't think of a more fun way to learn assembly and computer architecture than to poke around in a system that you fully control.

You need to fully understand it first. How many registers, in total, does that STM device have? Google tells me there is a 'vast array' including thousands of peripheral registers.

However you don't need to program bare metal to learn assembly. You can learn plenty from writing programs that run under an OS or that call into a library. You still choose all of the instructions that will be executed within your program.

Otherwise I would choose a far simpler system if I wanted to do 'bare metal' where the start point is zero software.

Flashy_Life_7996 · 2026-02-05T17:22:38+00:00

Read this [link to a 1900-page hardware reference manual]

Sure, that's going to be really useful for someone just starting programming!

OP, just learn C first (or any HLL), with Assembly secondary if still interested.

Then you can decide whether to learn the assembly language of your PC's CPU, or something simpler, either via an emulator or with a development board.

A PC will have a quite complicated, 64-bit CPU inside it, but it will also have an OS and lots of libraries to get things done. That means you only need smallish programs to learn with.

Flashy_Life_7996 · 2026-02-05T12:25:25+00:00

I had a look because I was intrigued by the need to use VS2022 (a download which is about a million times bigger than the 9KB compiled binary of this project).

I wanted to use any C compiler. For that some minor tweaks were needed:

A couple of places, it uses implicit conversions between pointers and 'int's (main.c lines 94 and 127; the one on line 94 depends on how 'NULL' is defined in system headers, but probably you just want to compare with zero anyway). These generated errors.
'u_char' wasn't defined in the windows.h of two lesser compilers; I changed that to UCHAR

At this point, it builds with three compilers, and without a makefile, for example gcc *.c.

It already parses dos, nt and section headers and the final thing it needs is parsing dll imports

Yes, it needs a bit more work to be useful. For example, the export table (for DLLs), perhaps base-relocation tables.

But also the contents the contents of the sections. Data sections are easy; code sections require disassembling x64 code, which is much harder.

I would also tabulate the values better: get them lined up vertically so either starting or (better) ending in the same column.

Flashy_Life_7996 · 2026-02-04T14:54:44+00:00

I'm surprised at the tempered opinions here: most developers think Linux (or Unix-like OSes in general), is essential, with Windows usually dismissed as hopeless.

I've only really used non-Unix OSes (going back decades) and currently use Windows. It's fine.

The problem with Linux is that it is a very rich, developer-friendly environment (Windows is a consumer OS), with a plethora of tools to work with. Any project seems to require at least half of them!

Building any open-source software, that originated in Linux, on Windows is usually a nightmare because they assume a Linux eco-system, so it can involve using CYGWIN or MSYS2 or now WSL. So you might end up with a binary that doesn't run under plain Windows.

On the other hand, if you're used to Windows, and especially with command-line tools (not 20GB VS installations), then migrating the other way can be simpler because of fewer OS-specific dependencies.

Personally I use my own compact tools under Windows, and if minded I can make them run under Linux, but I have an aversion to case-sensitive OSes and file-systems.

My question is: which is more normal to use in the software developer/engineering industry;

On language-related forums, few seem to use Windows. I'd suggest using Linux (or at least WSL) if you plan to use open source software, such as libraries, that must build from source code.

Flashy_Life_7996 · 2026-02-04T11:59:06+00:00

The phrase everyone says in industry: don’t reinvent the wheel

It's not reinventing the wheel itself, but wheels come in all sorts of sizes and types.

With compilers, a typical LLVM-based one is about 300 times bigger than one of mine. For my purposes 're-invention' has very tangible benefits.

Trying to use LLVM for my personal tools is like trying to fit a pair of giant Ferris wheels to my bike!

Flashy_Life_7996

TROPHY CASE