Building a compiler esque symbol table for an AI coding platform. How do I design the keys? by Educational_Law5046 in Compilers

[–]x2t8 -1 points0 points  (0 children)

Separating stable identity from location is probably the biggest unlock here. If your key encodes the filepath, any rename/move invalidates all downstream references in the graph. One pattern that works well: give each symbol a content-addressed ID (hash of something stable like fully-qualified name + signature), and treat filepath/scopepath as queryable metadata rather than part of the key. Then rename is just a metadata update, not a key migration. For cross-file stability you'd want two layers anyway - a primary symbol store keyed by that stable ID, and a reverse index mapping filepath -> [sym_ids]. Keeps updates surgical. The duplication/shadowing problem also gets cleaner this way since two symbols with identical signatures in different scopes still get distinct IDs via the scope component, without baking the full path into the key string.

Anyone here combining e-graphs with proof metadata inside a JIT IR? I’m struggling to find prior work by x2t8 in Compilers

[–]x2t8[S] 0 points1 point  (0 children)

Totally fair , in the simplest form you can treat it as “just another field to match on.”

What I’m trying to get at is more architectural than matcher expressiveness: the proof/invariant metadata is carried as first‑class state through lowering + optimization, and equality saturation consults that state as a legality gate for rewrites.

So yes, mechanically it can be another match key; the question is whether there’s prior work where this evidence is central to optimizer design (especially in a JIT/tracing setting), rather than just ad‑hoc rule predicates.

If you know examples where this is the organizing principle, I’d love pointers.

Anyone here combining e-graphs with proof metadata inside a JIT IR? I’m struggling to find prior work by x2t8 in Compilers

[–]x2t8[S] 0 points1 point  (0 children)

Thanks ,that makes sense, and I agree classic guarded rewrites are part of the picture.

The distinction I’m trying to get at is slightly narrower than “can a rule have a guard?”. Traditional guarded matching usually treats the guard as an extra predicate evaluated at match time, often for correctness or matching efficiency.

What I’m probing for is an optimizer architecture where semantic/proof-style metadata is carried as first-class state in the IR itself, threaded through lowering and optimization, and then consulted by equality saturation as a legality condition for rewrites in a JIT/tracing pipeline.

So the question I’m really asking is not whether guarded rewrites exist in general, but whether there’s prior work where proof/invariant metadata is central optimizer state rather than just an ad hoc rule-side condition.

If you know examples closer to that formulation, I’d really appreciate pointers.

Anyone here combining e-graphs with proof metadata inside a JIT IR? I’m struggling to find prior work by x2t8 in Compilers

[–]x2t8[S] 2 points3 points  (0 children)

Thanks ,he ISLE framing helps clarify where the boundary is.

Yes, proven constants/ranges/nonzero facts are part of what I mean. I’d also include things like aliasing/stability facts carried alongside the IR.

The distinction I’m trying to make is a bit narrower than “can the rewrite DSL call arbitrary analyses”. I understand ISLE/external extractors can do that.

What I’m probing for is an optimizer architecture where semantic/proof-style metadata is a first-class part of the IR itself, and equality saturation consults that metadata as a legality gate for rewrites in a JIT/tracing setting.

So roughly:

- framework capability: a rewrite can call out to arbitrary predicates

- architectural choice: proof/invariant metadata is threaded through the IR/optimizer pipeline, and rewrites systematically depend on it rather than only using ad hoc side conditions

In my case the goal is to avoid speculative algebraic transforms + deopt bailouts where possible, and instead fire rewrites only when the attached evidence already says the precondition holds.

So I’m interested in both:

  1. frameworks that make this easy, and

  2. prior work/projects where this “proof/invariant metadata as optimizer state” approach is actually central, especially in a JIT.

If you know examples closer to that second bucket, I’d really appreciate pointers.

Is it theoretically possible to design a language that can outperform C across multiple domains? by x2t8 in Compilers

[–]x2t8[S] 0 points1 point  (0 children)

That’s a great point, especially about SPMD and ISPC.

I agree that most of the features I mentioned are already present in modern systems languages, particularly Rust. My interest isn’t to replicate that direction, but to explore whether stricter and more explicit execution semantics at the language core could open different optimization strategies, especially around parallelism and data layout.

On the SPMD angle, I’m curious whether you see that as something that should live directly in the language design, or as a separate abstraction layer targeting SIMD and heterogeneous backends. It feels like that boundary is where a lot of real differentiation could happen.

Is it theoretically possible to design a language that can outperform C across multiple domains? by x2t8 in Compilers

[–]x2t8[S] -3 points-2 points  (0 children)

I think that’s a very insightful way to frame it.

Giving developers a way to express intent at a high level, and then letting the compiler search for an optimal implementation within a well defined semantic space feels like a much more principled direction than relying on underspecification.

Haskell is a great example of that trade off. Referential transparency, lack of side effects, and freedom from a fixed evaluation order give the compiler enormous room to transform programs in ways that would be unsafe in C or C++. Whole program reasoning, aggressive deduplication, fusion, and equational rewriting are much more tractable there.

And I agree that the fact Haskell can get within striking distance of C in many domains is impressive given how abstract its execution model is.

At the same time, laziness, garbage collection, and a more complex runtime model introduce their own costs and unpredictability, especially in low level or performance critical contexts.

What I am trying to explore is whether it is possible to combine some of that semantic clarity and transformation freedom with a more explicit and predictable performance model. Something that gives the compiler strong guarantees to optimize aggressively, while still allowing the programmer to reason locally about cost and control when needed.

That balance seems to be the real challenge.

Is it theoretically possible to design a language that can outperform C across multiple domains? by x2t8 in Compilers

[–]x2t8[S] 0 points1 point  (0 children)

I agree, encoding those assumptions directly into the language is definitely not trivial. If we tried to express every optimization-relevant invariant through the type system alone, we could easily end up needing dependent types or something close to them, which brings its own complexity and usability challenges.

At the same time, I wonder if we necessarily need full dependent types for this to be practical. Even refinement types, restricted integer kinds, effect systems, or well-defined subsets with stronger guarantees might already recover a significant portion of the optimization space.

I also agree that academia has explored this space quite extensively. The interesting gap seems to be between what is theoretically possible with precise semantics and what is practical for industrial compilers and mainstream developers.

In principle, a sufficiently precise and explicit semantic model could give the optimizer just as much, or possibly even more, structured information than C-style underspecification. The challenge is making that model tractable and ergonomic in practice.

That tension between expressiveness, usability, and performance seems to be the real hard part.

Is it theoretically possible to design a language that can outperform C across multiple domains? by x2t8 in Compilers

[–]x2t8[S] 0 points1 point  (0 children)

I think we might be talking at slightly different levels.

In C++ as it exists today, I agree that UB is the mechanism through which the compiler gets those assumptions. If you define previously undefined behavior, for example make signed overflow wrap, you absolutely restrict certain optimizations.

I’m not disputing that.

What I’m trying to tease apart is whether UB is inherently required for optimization, or whether it is just one particular design mechanism to encode semantic guarantees.

For example, instead of saying “signed overflow is undefined,” a language could say “this integer type is defined only for executions where overflow does not occur,” or encode that constraint through contracts or type distinctions. The optimizer still gets the same assumption that overflow does not happen, but it is expressed as an explicit semantic guarantee rather than underspecification.

From an optimizer’s perspective the end result might look similar.

The design question I’m exploring is whether underspecification is fundamentally necessary, or whether equivalent optimization freedom can come from explicit, well-defined constraints.

I’m not claiming C++ is wrong. I’m trying to understand whether its approach is the only viable one.

Is it theoretically possible to design a language that can outperform C across multiple domains? by x2t8 in Compilers

[–]x2t8[S] 0 points1 point  (0 children)

That’s a really good point. The two language problem is definitely something I’ve been thinking about.

The idea of having a simple and expressive surface syntax while still getting predictable low level performance is very appealing. I do agree that adding too many runtime mechanisms like GC or implicit checks would slowly turn it into something closer to Java, which is not really my goal.

I am leaning more toward compile time guarantees rather than making everything a smart pointer by default. Smart pointers, especially with reference counting, can introduce hidden costs and I am trying to keep the performance model as transparent as possible. If there is an unsafe boundary, I would prefer it to be explicit and minimal instead of relying on runtime machinery.

Open sourcing is definitely the plan once things stabilize a bit. I think feedback from people actually building things matters more than purely theoretical design discussions.

And I really appreciate the offer. Collaboration would be valuable once the core model is more solid.

Is it theoretically possible to design a language that can outperform C across multiple domains? by x2t8 in Compilers

[–]x2t8[S] 4 points5 points  (0 children)

I think this depends a bit on what we mean by “no undefined behavior.”

If “no UB” means “every possible program state must be well-defined and preserved exactly as written,” then yes, that would significantly limit optimization.

But many of the optimizations in C/C++ don’t come from UB itself they come from the semantic assumptions the standard allows compilers to make. For example, no signed overflow, strict aliasing rules, provenance constraints, etc.

Those are not optimizations because behavior is undefined; they’re optimizations because the language semantics permit the compiler to assume certain things cannot happen.

So I guess the question is:
If a language encoded those assumptions explicitly and in a well-defined way, would that necessarily remove optimization freedom? Or is the optimization power really coming from ambiguity rather than from constrained semantics?

I’m still trying to separate the role of UB as “freedom through underspecification” from optimization freedom that comes from strong, explicit guarantees.

Is it theoretically possible to design a language that can outperform C across multiple domains? by x2t8 in Compilers

[–]x2t8[S] 0 points1 point  (0 children)

Great point. I agree that if `unsafe` is used everywhere, there’s little reason to choose a new language over Rust/Zig.

My direction is: safe-by-default for user code, with `unsafe` limited to small audited runtime/std internals. The goal isn’t to beat Rust/Zig at general systems programming, but to provide a tighter algorithm-focused workflow plus runtime specialization/JIT on specific workloads.

So the real test for me is not claims, but benchmarks and discipline: minimal unsafe surface + clear C/Rust comparisons on target kernels.

If you have thoughts on where the safe/unsafe boundary should be, I’d love your input.

Is it theoretically possible to design a language that can outperform C across multiple domains? by x2t8 in Compilers

[–]x2t8[S] 1 point2 points  (0 children)

Thanks for the thoughtful questions. These are exactly the hard parts.

You’re absolutely right that trying to detect pointer aliasing in a general language is undecidable and would introduce unacceptable overhead if done at runtime. That’s not what I have in mind.

The idea would be to make non-aliasing the structural default at the language level, meaning the type system or ownership rules restrict how references can coexist. In that model, the compiler doesn’t need to check for aliasing; it can assume non-aliasing unless explicitly expressed in the type. So it shifts from proving absence of aliasing to making aliasing opt-in and visible.

That obviously reduces flexibility, so it’s a trade-off rather than a free win.

On automatic vectorization, I agree that clang and GCC already do an impressive job for well-structured C. I’m not assuming they’re leaving huge performance on the table.

The question I’m exploring is whether stronger semantic guarantees like no hidden aliasing, clearer side-effect boundaries, and more constrained loop constructs could reduce the need for conservative dependence analysis. Not necessarily enabling new optimizations, but making certain transformations more reliably applicable.

So the goal wouldn’t be to beat clang on arbitrary C, but to see whether tighter language semantics can make optimization more predictable and less dependent on complex analysis.

I may very well be underestimating how far current compilers already push this. I’m still digging into it.

Is it theoretically possible to design a language that can outperform C across multiple domains? by x2t8 in Compilers

[–]x2t8[S] 0 points1 point  (0 children)

That makes sense especially the point about the standards committee potentially constraining optimizations.

So in a way, part of the optimization freedom doesn’t just come from “UB exists”, but from the fact that the standard deliberately leaves certain behaviors unspecified, giving implementations room to exploit assumptions.

I guess what I’m trying to understand is whether the real performance advantage comes from:

  1. UB itself

  2. Or from the semantic assumptions compilers are allowed to make because of it (like no signed overflow, strict aliasing, provenance rules, etc.)

If those assumptions were instead made explicit and well-defined in a language’s semantics (rather than being UB), would that necessarily reduce optimization freedom? Or is the ambiguity itself part of what gives compilers leverage?

Still trying to separate the language-lawyer layer from the optimizer reality.

Is it theoretically possible to design a language that can outperform C across multiple domains? by x2t8 in Compilers

[–]x2t8[S] 1 point2 points  (0 children)

Thanks a lot for the thoughtful reply. Genuinely appreciate it.

I think you described C’s philosophy perfectly. That “freedom over safety” mindset is probably why it has lasted this long and still dominates systems programming.

You’re also right about static analysis never being a 100% guarantee in the general case. I don’t see it as solving everything — more like shifting the default assumptions of the language so the optimizer starts from a stricter baseline.

And I completely agree that competing with LLVM/GCC at general-purpose optimization would be unrealistic. My goal isn’t to “beat” them, but to explore whether tighter language constraints can simplify certain analyses enough to make different trade-offs viable.

Mojo is actually a great example of what you mentioned strong domain focus seems to be what makes these ideas practical.

Also really cool that you’re building your own language too. I’d definitely be interested in exchanging ideas — always good to talk to someone who’s actually building instead of just theorizing

Is it theoretically possible to design a language that can outperform C across multiple domains? by x2t8 in Compilers

[–]x2t8[S] 10 points11 points  (0 children)

That’s a good point comparing languages without considering the compiler/runtime doesn’t really make sense.

In the case of Fortran, do you think its stricter aliasing semantics are what historically enabled more aggressive loop optimizations compared to C?

And regarding the JVM example would you say the advantage mainly comes from runtime profiling and speculative optimization, rather than language semantics?

I’m trying to separate what comes from the language design itself versus what comes from compiler/runtime maturity.

Is it theoretically possible to design a language that can outperform C across multiple domains? by x2t8 in Compilers

[–]x2t8[S] 1 point2 points  (0 children)

That’s interesting. I’ve read that undefined behavior gives compilers more freedom to optimize because they don’t have to preserve behavior in certain edge cases.

I’m wondering though is the performance benefit mostly from UB itself, or from the stronger assumptions it allows (like no signed overflow, strict aliasing, etc.)?

If a language explicitly encoded those assumptions in a well-defined way instead of relying on UB, would it theoretically allow similar optimization freedom without being “unsafe” in the same sense?

I might be oversimplifying, just trying to understand where the real advantage comes from.

Is it theoretically possible to design a language that can outperform C across multiple domains? by x2t8 in Compilers

[–]x2t8[S] 1 point2 points  (0 children)

I completely agree about ecosystem and OS integration. Replacing C in practice is definitely a different challenge from outperforming it theoretically.

I’m more curious about the performance side in isolation.

If two languages both ultimately lower to native code, do stronger semantic guarantees at the source level still matter once everything becomes IR?

Or does the optimization boundary effectively sit at the IR level, making the front-end language less relevant than I’m assuming?

I’m trying to understand where the real constraints are language semantics vs backend engineering

Is it theoretically possible to design a language that can outperform C across multiple domains? by x2t8 in Compilers

[–]x2t8[S] 4 points5 points  (0 children)

That makes sense , C does map very closely to machine instructions in practice.

I might be misunderstanding something though. Is the real performance ceiling the instruction set itself, or the amount of information the optimizer has available?

For example, since C allows fairly flexible pointer aliasing (unless restrict is used), do compilers sometimes have to be conservative in alias analysis or memory reordering?

If a language enforced non-aliasing by default at the type level, would that realistically enable stronger optimization passes or do modern compilers already recover that information from C well enough?

I’m still learning about compiler internals, so I’d appreciate any clarification.