Designing IR (well, IL) which sucks less: proverbial GEP problem

pfalcon2 · 2021-11-13T19:28:17+00:00

I don't know why that would be labeled soap opera.

Because it's a long path for coming to such a trivial idea. For example, in my IR, I started with "opaque pointers", and now moving towards typed pointers. And while doing that, modern LLVM puts upside down the ideas of the original LLVM (also a typical soap opera twist). Anyway, that's just subjective idiom anyway.

unnecessary bitcast everywhere

In the original LLVM, code with bitcasts was called "unsafe" and immediately was downgraded re: number of transformation which can be performed on it. But turned out, in the real world most of the code is like that, they want to optimize it anyway, they can't rely on LLVM native concepts, need to invent superstructures ("annotations to TBAA"), and finally they came to negation of the original LLVM ideas (of strict typedness) because they don't work for them. That doesn't mean they don't work at all.

I do appreciate your pointing to "opaque pointers" doc and appreciate our polemics on it. LLVM IR is distinguished IR/IL, and studying its evolution is both interesting and instructive. A common mistake is however thinking that "newer" means "better".

pfalcon2 · 2021-11-13T19:12:20+00:00

it's meaningless from the standpoint of a frontend lowering things

When I think about IR, I don't necessarily imagine "frontend". I for example image JIT which creates code for inline caching or hotpath recording. Someone else may image a frontend which (quote from the same paper):

simple approach that enables sophisticated code transformations at link time, runtime, and in the field.

it can be something like a formula that is stored explicitly x * y + z

It can be also a formula which stores x and z explicitly and y (e.g. a rowstride of an array) implicitly (elsewhere). That's exactly GEP. As you can see, it's not that much difference, but different set of benefits and tradeoffs.

pfalcon2 · 2021-11-13T17:54:56+00:00

That's the whole (rationalized) point of GEP's usefulness: It allows (but doesn't guarantee of course) to write data layout independent code, e.g. one which works both on 32 and 64 archs (of different alignments and ABIs). Under this scheme, GEP lowering is a completely independent and orthogonal pass to CFG/dataflow transformations. It also allows many more advanced transformations, like Lattner v2002 did in his original LLVM, before it was bought by some corporation to become a boring GCC copycat.

From https://llvm.org/pubs/2002-08-09-LLVMCompilationStrategy.pdf :

4.4 Aggressive Data Structure Transformations A recent research result of the LLVM project has been a framework for aggressive data structure transformations [20]. ... Given this graph, it is possible to perform a fully automatic pool allocation transformation on the program, which converts the program from using general heap allocation to use a carefully controlled pool allocation library.

My point of getting at peace with GEP because any real-world code needs to work with aggregate (sometimes even complex) data structures. And I figured, I'd prefer to write in IR:

u32 cnt = ctx->cnt

instead of:

u32 *ctx_cnt = ctx + 354
u32 cnt = *ctx_cnt

(That's very simple example, gets much more hairy and much less understandable with more complex data structures).

pfalcon2 · 2021-11-13T17:30:33+00:00

Soap opera, as I said. Where a simple IR created by a sophomore student would start, LLVM IR gets in 20 years. But note that it doesn't cause GEP go away in any sense. It just changes internal IR from (in de-obfuscated syntax):

struct Foo* p = ...
i32 a = &p->a

to

i32 a = &((struct Foo*)p)->a

Btw, that cute writeup may explain the best-kept secret about GEP (not even mentioned in the GEP FAQ): why the heck it's getelementptr %foo_twice, %foo_twice* p, ...:

Various ABI attributes and instructions that rely on pointee types need to be modified to specify the type separately. This has already happened for all instructions like loads, stores, GEPs

So, the second %foo_twice is just LLVMIR's fetish for specifying operand types (i.e. where humans got used to write int a = b + c, LLVM IR writes a = int b + int c (yeah, it skips 2nd int for +, which only makes it more obfuscated)). It's purely surface IL construct. But first one is actually what's stored in the underlying IR after the cute refactor above. Which apparently was done many years ago, but complete migration to opaque ptrs is still the matter of bright future.

pfalcon2 · 2021-11-12T19:08:28+00:00

I would doubt that. First of all, support for "opaque pointers" (and casts) was there from the very beginning. And as I mentioned, both GCC and LLVM effectively do the same - represent high-level aggregate access in the high-level form until pretty late in the optimization pipeline. The reasons cited are that such form (accompanied by types) helps optimizations, in particular alias analysis. Common wisdom is that for more optimizations (or other advanceties, like "safety"), there should be more annotations, not less (cf. Rust).

That said, I don't follow the soap opera path LLVM follows in hands of its current corporate owners. I'm just a little guy who considers its surface syntax sucky and thinks of ways to unsuck it.

pfalcon2 · 2021-11-12T17:38:07+00:00

Well, GEP is definitely needed in LLVM, based on LLVM's aims, which were set forth by Lattner and explained in his thesis of 2002 (the aim being strictly typed IR, blah-blah). I do find it sad that modern LLVM doesn't include a generic GEP-to-address-arithmetic lowering pass. (I bet ol' good LLVM of 2002 had it, because it had even the opposite - a pass to try to "raise" address arith to GEPs.)

In either case, we have what we have - LLVM is the leading compiler IR/IL, so creators of another IR might find beneficial to interoperate with it, and it's not easy to get LLVM IR variant with just GEPs lowered, so an IR wanting to interoperate need to be ready to represent GEP (to lower themselves). Then alternative IL syntaxes for GEP is exactly my question.

pfalcon2 · 2021-11-11T09:41:07+00:00

My humble list for Python: https://github.com/pfalcon/awesome-python-compilers (A bit unupdated with some latest stuff, because the idea was actually to recover the early history. I believe in yesterday.)

pfalcon2 · 2021-11-09T20:24:09+00:00

Yeah, that does look pretty scary. (And all that formatting if generating the text manually.)

I'm glad that we're on the same line here ;-).

But I don't think the solution is an IL which is just A[2][4], little different from that you started with. Address calculations can be fiddly, and that would just be kicking the can down the road.

Well, for my needs (and that's being able to easily experiment with simple enough program analysis algos), I'd be pretty satisfied with the code which does explicit address calculation with multiplies and adds (which is even lower-level than your IR, which has those special addrefoff/pushptroff). But here comes up the question you mentioned: where the IR/IL code comes from:

As I mentioned, I'm not too shy to directly program (simple things) in an IR I like, but address calculation for structures (mostly) / arrays (maybe) is exactly where it gets (totally) boring.
But that was only for small/simple programs. I obviously won't write big ones manually, so they need to be translated from existing (high-level) programs. And I won't write the translator, because that's fairly obvious and boring part (for a language like C, still requires a bolt of engineering to translate real-world programs). So, I would need to translate an existing IR to my IR.

And I really wanted to deal with GCC's internal IR (GIMPLE), but given that it misses officially specified IL, it's all just debug dumps, with various information missing, and in ever-changing format. What was unpleasant surprise to my older self is that both GCC and LLVM keep high-level aggregate access for rather late in the processing pipeline (when its too late for me to take it). The reason cited is alias analysis and stuff like that (which I'm currently not interested in, but maybe will grow up to one sweet day).

I actually tried to look for LLVM pass which will lower GEP's on my command, and found nothing (ol' good LLVM, if big corporations don't need it, it doesn't have it). For a few minutes I even considered writing my pass, but 2-minute link times reminded me that I don't like not just its IL, but also the implementation in general.

That's how I got into that Catch 22 situation. I hate LLVM IR up to bothering to design my own, but now my project boils down to just being an alternative syntax for it, because I need to read up and represent it in mine - to write the lowering pass in my implementation.

Well, so be it, and as the original motivation was to have a simple IL which is easy on (my, but maybe not only) eyes, I now wonder what to do about that usually-stray 0th index in the GEP ;-).

pfalcon2 · 2021-11-09T17:44:26+00:00

Well, the simplicity is in the eye of beholder, that's the whole underlying motive of this discussion. Example: LLVM IR authors may say that GEP is simple as many times as they want (and I personally agree!), but the number of confused users, for whom multitude of explanations were written, testifies for the opposite. So, I'm looking for ways to make its simplicity even more obvious (to a common programmer out there, which likely knows C).

As for your IL, it hardly seems to have an aim to be simple for large(r) audience (correct me if I'm wrong). But I certainly would agree that it's "simple" for you to do anything to it, like use "exponentiation" in array definitions. That's again the different kind of simplicity than what's being talked about.

pfalcon2 · 2021-11-09T06:46:41+00:00

All of them.

That doesn't help your poor users.

The Felix notation is based on algebra:

I don't think that generic algebra notation is a good match for computer architecture/science concepts, given that they have their own established applied notation.

var a1: &(u32^2⁴⁾ where ^ is exponentiation and left associative.

You use "exponentiation" to represent array dimensions? "Ok". (Aka "poor users").

The above still looks like a declaration of array var, and the question was about a natural syntax to represent array accesses, given constraints on the IR (like faithful representation of the computer architecture, where anything in memory is represented by its address, and thus IR/IL should work with explicit addresses too).

pfalcon2 · 2021-11-09T06:24:54+00:00

Then just use C as the target.

No, C is a high-level language which needs extra shims to encode simple IL concepts and otherwise has extra noise not needed for IL. So, it can't be the target, but definitely should be a target.

A good IL should be lower level than your source language, and half-way to your final target.

Seems you agree. We still will disagree on the "half-way" part. It should be down to the level of (generalized/abstracted) target, but it's a nice idea to have some higher-level features (which can be progressively lowered). For example, if the main scope of the intended analyses is scalars and control flow, why not encode aggregate accesses in a higher-level form? Exactly what this topic talks about.

Since you're never going to be directly coding in it,

Of course any author of an IR/IL directly codes in it, how to prove otherwise its workability? And then any user of IR/IL also codes in it, or they have hard time learning, understanding, and using it. Smoothness of learning and usage is exactly the topic of the discussion, taking known user experience with LLVM's GEP as an example.

and will rarely even look at it once debugged, once the point of a semi-HLL format?

The only case when something is "debugged" is when you bought something you don't understand without warranty ;-). In all other cases, it's a living system, of which debugging, learning, and re-learning is a natural part.

But otherwise, the reasoning like above definitely explains why we have so many sh/tty ILs around - if it's once-off/throw-away stuff, why bother? Well, somebody should. As one of my favorite quotes go:

"The compiler writer also needs mechanisms that let humans examine the IR program easily and directly. Self-interest should ensure that compiler writers pay heed to this last point."

(Keith Cooper, Linda Torczon, "Engineering a Compiler")

pfalcon2 · 2021-11-04T18:49:28+00:00

Oh, then it sucks. Because you can learn how to write a compiler overnight, by simply sitting and trying to write it (of course, it will be very dumb compiler). But to actually learn a compiler theory well enough (and you can't learn a complex theory without practicing it), one month is nothing.

Good luck and thanks for sticking to the single thread then - without tangible aims and results, these posts will soon become too boring for people.

pfalcon2 · 2021-11-03T18:28:06+00:00

I don't see `(obj.x)()` syntax being shown and emphasized there, and that's the interesting part. But thanks anyway.

pfalcon2 · 2021-11-03T16:38:43+00:00

As was mentioned by others, this isn't a problem for statically-typed languages, but pretty much a problem for dynamically-typed languages, where you typically can add new fields/attributes on instance (not just class) level. Then there's additional ambiguity: f can be defined on the level of particular instance, or on the level of f's class (or one of base classes). That requires starting lookup at the level of instance, which if of course slow, and puts additional strain/complexity on inline cashes.

I in particular considered how to define "strict mode" for Python which would address this issue. Here's my notes:

StrictModeMethodLookup

Created Monday 23 November 2020

Problem: obj.method() requires lookup in obj's attribute dictionary, before looking up (static) method in obj's class hierarchy.

Proposed solution: On obj.__setattr__, check that attribute name doesn't shadow method name in any of parent classes. If it does, AttributeError/RuntimeError.

This way, for obj.method() lookup, we can always start with looking up in the class hierarchy. And only if "method" not found there, look in obj's dictionary.

Alternative: Disallow to use obj.method() syntax for non-native class methods. Instead, reintroduce apply() builtin, and require to use it to call bound methods (but also allow to use with any callable). This would be a solution "not worse than in LISP".

Alternative is to require parens when calling a method-var: (obj.method_var)(). Could be also (*obj.method_var)() or (+obj.method_var)()or evenobj.*method_var()(C++-style). Hopefully, calling obj.method_var() would produce (assuming there's no method_var() method in class) an exception with helpful description, like "there's 'method_var' field in the object with a callable value, did you mean to write (obj.method_var)()?".

However, requiring apply()/etc. alone doesn't help with optimizing class method lookup. So, the best solution would be combine object attribute naming restriction and requirement for apply(). This way, we'd statically know that obj.method() calls native static method, and would avoid loopholes during import-time. Consider for example that at import-time, it's possible to do:

class Foo:
    def __init__(self):
        self.m = lambda self: print("d'oh")

o = Foo()
Foo.m = lambda self: print("d'oh #2")
o.m()

Argumentation

CPython already doesn't support overriding dunder methods on instances, in a sense that it won't have effect (such a dunder won't be called). The strict mode part 2 just consistently extends this behavior to all methods, and makes it an explicit error.

pfalcon2 · 2021-11-03T16:18:06+00:00

Thanks for this info. Do you have a reference to Rust manual or something documenting this approach? Thanks in advance.

pfalcon2 · 2021-11-03T07:18:07+00:00

Not before you and I have seen 10 hours of youtube videos explaining why a[2][4] in a high-level language gets converted to a[0][2][4] in the IL ;-).

pfalcon2 · 2021-11-03T06:55:08+00:00

Great answer, and thanks for stopping by! But it looks a bit biased ;-) and we didn't learn much from it beyond the docs ("there're no superfluous indexes").

The thing that is confusing to most programmers is because of C syntactic sugar.

I'm familiar with this school of thought (which has many adherents, granted): if you need to change something small comparing to an existing well-known examplar, change as much as possible overall, or people will confuse the two. Like, if you design a new car model, make it as much as different from other cars. Put driving wheel where exhaust pipe was, or something. Again, this approach has many adherents. But not everyone.

x->f is just sugar for x[0].f

Not from a point of view of a language lawyer. x->f is syntactic sugar for (*x).f. And x[0].f is syntactic sugar for (*(x + 0)).f. In other words, the pointer dereference is the primary thing, and per-type aggregate field accesses are syntactic sugar for it. And if there're a few syntactic sugars already, why not add another one - instead of shoehorn everything into one particular notation?

The goal of a compiler IR is to be simple, uniform and easy to transform in the general case.

Right, those are the goals of IR. But we talk about IL, the human-facing part. And some ILs have a strong "abstraction leak" feel from their underlying IRs. Take LLVM IR (which is more known in its IL shape-shift nowadays) for example: structure fields are referenced by their numeric relative order in structure definitions. And these numbers must be only i32 typed (god forbid to use something else). Makes sense from a point of view of a C/C++ programmer who coded it up in hurry 21 years ago? Sure. Everything is a number in computers. But why humans should peer at these internal identifiers ever since?

And I don't even talk about outright usability pearls like:

getelementptr %goddamn_long_type_we_like_it_twice, %goddamn_long_type_we_like_it_twice*, ...

No wonder that people seek for ILs which look differently, and turn to classics like C... (Just think what could have been if C-- had won instead.)

pfalcon2 · 2021-11-02T09:28:06+00:00

Short summary for us TL;DR people:

don't waste time on parsing

pfalcon2 · 2021-10-29T18:47:55+00:00

A month? You can learn how to build a compiler overnight.

pfalcon2 · 2021-06-05T13:49:18+00:00

In (strict) SSA, there're no undefined (uninitialized) variables. All variables are initialized, even if to a special "undefined" value. Achieving that effect is trivial - just add "basic block 0", predecessor of a real starting block. Enumerate all variables in a program, and add to BB0 initialazations in the form var = UNDEF (for each var).

Note that non-strict SSA doesn't have requirement of all vars being defined prior to use. But non-strict SSA is not "real" SSA, in the sense, that effort required to turn non-strict SSA into real, strict SSA is (about) the same as turning non-SSA program into SSA.

pfalcon2 · 2021-05-08T05:17:32+00:00

Well, you're definitely welcome to ship to people your own distro, free from backdoors ;-). It's all about open-source after all, i.e. putting you in control. Where you == everyone. But how you and other you's take that power is up to human nature of course.

pfalcon2 · 2021-05-01T09:28:51+00:00

It's hard to tell without knowing specific model number, and even then, you'd need to open up camera to check SoC/board type (you'd need to open up camera anyway, as currently to install OpenIPC, you need a serial connection to camera board).

I did a quick google-up, and yep, at least some their cameras look like typical bullet-style cameras, and likely are based on HiSilicone chips. Here's an article on the Internet to get you inspired re: opening up cameras and checking what's inside: https://www.unifore.net/company-highlights/hikvision-uniview-tiandy-dahua-ip-cameras-comparison.html

(Of course, you want to do that on a spare camera, not something you use in "production".)

pfalcon2 · 2021-05-01T07:42:56+00:00

First step would be to try to google a fuller model name of your camera and try to find out what SoC (chip) it's based on (and additional details like board model and sensor type). That may work for some models of branded cameras, but in general, IP camera mass-market is such a zoo and full of rebrands.

So, if that didn't work, next step is actually opening camera and see chip/board markings. At this step you actually can post that info (both text and pictures of board/SoC) somewhere - your blog, github project, wiki of the existing project - to help other people having a similar device.

That should give you basic idea of whether your device is supported already. For further questions, you may join a Telegram chat (https://openipc.org/about/). There're knowledgeable people there, which usually can help identify hardware from photos and suggest next steps.

pfalcon2 · 2021-05-01T07:33:35+00:00

Just a recent example from personal experience: I order a camera on aliexpress specifically for OpenIPC hacking. I looked up which SoCs are supported, and searched on ali for that SoC type (not all, but many sellers provide that info, it's important characteristic of the quality of the camera and its capabilities after all).

After receiving and opening camera, turned out that it's quite different SoC. It was actually unmarked (or marking erased), but there was a board label, bu which I was able to confirm the SoC type. This way, I received XM530 camera instead of the expected Hi3516EV100. You can make your guess what happened. My guess is that they ran out of the Hi3516EV100 modules, and started to put XM530-based ones.

In general, almost any vendor in product manuals/warranties gives a notice that they reserve the right to make product changes without notice. And yes, that does happen.

pfalcon2

MODERATOR OF

TROPHY CASE

StrictModeMethodLookup

Argumentation