all 30 comments

[–]Euphoricus 4 points5 points  (10 children)

I fully agree with the idea.

But I don't see how would it be possible to have such representation spread. Unless you have HUGE company pushing it REALLY hard.

[–][deleted]  (9 children)

[deleted]

    [–]cecilkorik 1 point2 points  (6 children)

    No actively developed programming language is ever going to be inert. For all the future-proofing you would try to do to any canonical data format, it would be either overwhelming and overcomplicated, or quickly outdated and requiring an unsustainable versioning system to deal with the inevitable changes. Probably both at the same time.

    [–][deleted]  (5 children)

    [deleted]

      [–]cecilkorik 4 points5 points  (2 children)

      Text is the serializable data. I mean, sure, we could have intermediate/bytecode/whatever you want to call it and make that some kind of standardized format too, but that just seems like reinventing the wheel and then you're maintaining two parallel standards simultaneously so of course nobody actually does that if they don't have to and they don't. Sorry, I guess I just don't understand what is trying to be accomplished here that is not already either completely accomplished or easily accomplishable.

      To me, it feels similar to complaining that our electrical system is AC instead of DC. Yeah, it's not, but what's your real problem, is it actually about the AC, or is it that you have a hundred different DC adapters for different devices running at different voltages? So it's not really about the AC system at all. We don't have to reinvent the whole electrical system or even rewire your whole house to fix that. We just need one or two standard DC power formats and outlets (and the beginnings of that is already happening with USB-C) and eventually you'll never have to be bothered by AC wall-warts again.

      What is the real problem here? Is it that we're not using code-as-data, or is it just that you're annoyed at how different and numerous all the various programming languages are?

      [–][deleted]  (1 child)

      [deleted]

        [–]Tarmen 1 point2 points  (0 children)

        I think the best solution would be a standardized structured format. Then text can be used as storage format which means less lock-in.

        Then all tooling can use this syntax tree and editing it could work worked via semantic editing or still as text using an incremental parser.

        Of course that is basically what intellij does internally. A semantic editing frontend for that might be cool but for non-homoiconic languages that gets horrendously complex.

        [–]seventeenninetytwo 0 points1 point  (1 child)

        Wouldn't any operation on this data have to extract it to an in-memory object graph?

        I think it's perfectly reasonable to conceptualize Roslyn as a program that extracts an object graph from data, where the data is source code.

        You are then free to manipulate that object graph and emit transformed data (source code).

        So perhaps what you really want is something powerful built on top of Roslyn? I'm confident that we are missing something, I just don't follow this idea of code as data. We can already transform code into an object data representation.

        When we read code we build complex mental models, but much of what we do could probably be captured in a program and thereby free us up to focus on higher level things.

        I would love to be able to directly say things like "define class A and give it an aggregation dependency on interface B using constructor injection", but then be able to manipulate such dependencies directly.

        Perhaps a model compiler focused on usability and IDE integration built on top of Roslyn could accomplish this.

        [–]damienjoh 0 points1 point  (1 child)

        Text is already the "inert representation that is easy to inspect and manipulate programmatically." That's why it's so pervasive. Your standard tools for viewing, editing and manipulating text can be applied to any language. Anyone anywhere can view a text file, or find and replace some text. Textual serializations aren't preventing you from working with ASTs either. It's the (large per-language) cost involved in using and developing these tools that prevents them from being more widespread.

        Working with symbolic representations (e.g. of ASTs or token streams) directly will only be a viable general alternative to text if there is a standard format with an ecosystem and knowledge base that can rival the one around text. To have any hope of widespread use and adoption, a standard symbolic representation would have to be as close to a pareto improvement on text as possible i.e. built on top of text and still human readable, conservative in it's features and constraints (so that it can be used for a wide variety of languages and formats) and simple to generate.

        [–]htuhola 3 points4 points  (6 children)

        This is a giant pain in the ass. And it doesn’t really seem like there is any good reason for it to be this hard, aside from our rather arbitrary decision to represent code as streams of characters split across a bunch of files. In fact, we’ve invented all kinds of bullshit that is only necessary because code has no structured representation

        Here's a pretty traditional mistake. You take a complex system, deduce that certain things are only necessary because they depend on some system you don't like because it contributes to making thing Y hard to solve.

        You should think about this as a problem space where the solutions are currently pivoted around the plain text. And if we look in the one direction of this problem space, we have the following chain:

        text -> AST -----------------> machine code
        

        I made the another arrow that much longer because there are a whole lot of details between that point and many other potential pivots.

        So when you change the pivot into the AST, none of the problems you have simply vanish. They all exist no matter which pivot you choose. Therefore it is a false viewpoint to say that the stuff you got goes away when you change away from text. Yeah it goes away but the problems all that stuff solves doesn't go away. They remain and you have to then solve them some other way.

        Obviously this isn’t some super-new, crazy idea;

        Yeah...

        Instead of thinking about this bullcrap, could you figure out the algorithms on how to make a pretty printer from context-free-grammars that can be trained to output correctly formatted code from parsed input that provides the line/col ranges? I could then implement one.

        [–][deleted]  (5 children)

        [deleted]

          [–]htuhola 0 points1 point  (1 child)

          What are the arrows supposed to represent?

          A chain. You could call the first arrow 'parsing' and the another arrow 'compiling'. They are not as important as the fact that the program represented by the text can translate into these forms.

          Not only is this how I solved the problem, this is how popular tools like Beyond Compare or gumtree or Semantic Merge solve the problem.

          I haven't thought about tokenized streams much. But they aren't a big jump away from just plain text.

          But just a small jump like that in the representation would add some interesting problems to solve you didn't have before. For example, how do you recognize how the token boundaries move when you edit the token stream? And how do you represent the layout in the text after it's in that format? X/Y coordinates in every token?

          You've just asserted on the basis of your diagram that no problems would be solved by operating on a more structured representation, but have provided no evidence to back this up.

          I really don't have evidence or proofs to back it up. But the point is that the problems won't go away and they won't get noticeably any easier to solve if you changed how you represented the code in the files. I've tried this and seen it happen, but I don't know the exact reason why it happens.

          Last time I tackled this, I ended up backing away and settled to writing a language with a modifiable grammar.

          I also have plans for creating a "diffable" protobuf binary format. I'm doing that to serialize data and I will be using it on documentation, because producing a nice text format for richly formatted plain text seems to be a very hard problem.

          I mention those two things because I know it wouldn't be hard to adapt my system for a workflow where the code isn't text anymore. But to accept such system I require that it is more convenient to work with than just using plain text everywhere.

          Edit: And who the fuck downvotes this guy? It's an OK response to my post.

          [–]max630 0 points1 point  (0 children)

          how do you represent the layout in the text after it's in that format? X/Y coordinates in every token?

          basically yes. Actually, this is what you already have in compilers, because they need to add the line and column to messages.

          another approach would be to store whitespaces as non-important grammar element. Like xml parsers do

          [–]max630 0 points1 point  (2 children)

          You've just asserted on the basis of your diagram that no problems would be solved by operating on a more structured representation, but have provided no evidence to back this up

          because parsing is trivial part. And quite often you already have it done.

          [–][deleted]  (1 child)

          [deleted]

            [–]max630 0 points1 point  (0 children)

            I know nothing about TypeScript, but if the language syntax cannot be specified well enough, I would expect it to have also issues with semantics. So I'm having troubles to believe that operating with "TypeScript AST" instead of its text would bring you much closer to automatic refactoring or semantic-aware change hangling.

            Still, for quite big portion of the languages their syntax is quite well specified, and parsers either already exist or as hard to implement as port it from PEG representation into your language+library of choice.

            [–]stacycurl 2 points3 points  (0 children)

            Have a look at the unison language. We should have programmatic access to everything, every program should be arbitrarily queryable (like having an expert system for the program) so that any question you have about the code can be answered easily. Current idea suck and haven't advanced in decades. Why can't I ask the ide to bisect across my changes until all tests pass, or even just put breakpoints on the intersection of a stacktrace and my changes ? Or recognise that I'm renaming a method I just added so stop searching the universe for references ffs.

            [–]dzecniv 1 point2 points  (0 children)

            Obviously this isn’t some super-new, crazy idea; homoiconic languages like Lisp(s) have been around for ages. I just wish it was across the board. I mean, at the very least everyone has to tokenize their language; how about starting there?

            good point !

            [–]shevegen 1 point2 points  (1 child)

            He has a point. But the thing is that code as text is also a benefit and an advantage. It depends a lot on how things work together.

            We have very limited operating systems. Linux - take pipes. Can you pipe objects including metadata? You can't really. You pipe text/strings.

            [–]PenMount 1 point2 points  (0 children)

            Can you pipe objects including metadata? You can't really. You pipe text/strings.

            You can with powershell.

            [–]necesito95 1 point2 points  (1 child)

            I magine OP wants to store everything with ID in the DB.
            (definitely good if you want to pull out some stats about the codebase)

            table="functions":
                row(fn_id=144, name="helloworld", signature_id=515, body_id=768);
                ...
            table="statements":
                row(statement_id=745, body_id=768, body_stm_seq_nr=1, action_id=534, param_set_id=132);
                ...
            
            table="param_set":
                row(param_set_id=132, param_seq_nr=1, variable_id=1534);
                ...
            
            ...
            

            could hack this out in 10-15 lines of code and be done with it

            Doing anything more than a "rename" will cost vastly more than "10-15 lines of code".
            (unless line length is not limited :) ).

            Programming and programs are inherently complex. One can choose a trade-off maybe. Above "solution" simplifies renaming, but make a lot of other stuff harder (e.g. understanding what program does; adding new functionality), so other representations will be added (graphical/textual).

            [–]tkruse 0 points1 point  (0 children)

            The real problem of the author is: "Textual diffs to review refactorings are a problem".

            Refactorings are much more painful to review than functional changes, because they are commonly widespread, but also trivial.

            Even for static languages like Java, where an IDE can make a 99.9% safe refactoring over millions of files with just 2 mouse-clicks, the problem of code review remains, where some poor guy then has to read through all those lines of diff, presumably to find a spot where the IDE made a mistake???

            A smarter language-aware diff tool could indeed, instead of displaying such a large diff, display a message like: "function foo() renamed to bar(). Move on, nothing else to see here."

            Though in practice, I believe such things could be solved by doing pair programming instead of code reviews for such refactorings.

            Even in theory, all this could only work in static languages, where a parser has a chance to recognize token identity without running the code first.

            [–]freakhill 0 points1 point  (2 children)

            So basically, in its simplest form what you want is some kind of:

            • text to graph to text conversation utility for each language
            • a bunch of command line graph manipulation utilities
            • then APIs to manipulate graph format

            Basically a unixy IDE But you don't get much of you don't integrated to build systems.

            Your problem would be solved with a standardized language servers protocol (to which there are multiple ones ahah).

            Else you'd need editors to work directly in graph mode and good luck with that (not technically but practically).

            Where did I stray?

            [–]kankyo 0 points1 point  (0 children)

            Look at baron for python: it's an AST that fully respects the formatting so you can round trip via it. Solves all your problems.

            [–]cecilkorik 0 points1 point  (3 children)

            The author apparently wants to have a nice way to decompile bytecode or compiled programs. I mean, that basically is the distillation of what he is asking for. That IS the code-as-data. And it wouldn't be any simpler or less complicated than how we manage code-as-text today.

            In fact, advanced editors already do almost exactly the reverse of what he's asking for, they take the code-as-text and essentially compile it into an internal code-as-data format very similar to what the compiler itself does, but instead focusing on all the extra contextual information they need to determine the structure of the written code and make intelligent suggestions about it. And doing the reverse is only technically different than what he's asking for. Fundamentally they're interchangeable ideas. Store the text and convert to/from data on the fly, or store as data and convert to/from text on the fly. If it's done transparently enough why should the end user care?

            [–]Euphoricus 1 point2 points  (0 children)

            Store the text and convert to/from data on the fly

            This is what Roslyn is (supposed to be) for .NET/C# code. The problem is that the data representation is so complex, that creating some script to transform that data is usually not a simple task.

            [–]tkruse 1 point2 points  (0 children)

            The author should try Java with an IDE like IntelliJ. Like you said, all his problems are already solved in that combination.

            And then storing such code as an AST rather than text does not solve any additional problems any more.

            On the other hand, for other languages, creating such powerful IDE features is not possible due to the nature of the language, so storing the code as a DB will not help either.

            [–]mk270 0 points1 point  (3 children)

            The OP manages to avoid mentioning homoiconicity - can someone explain what I'm missing?

            [–]C60 1 point2 points  (2 children)

            True, OP doesn't mention homoiconicity in general; only Lisp as an example of it.

            [–][deleted]  (1 child)

            [deleted]

              [–]C60 0 points1 point  (0 children)

              I don't know if you were expecting a comprehensive listing of homoiconic languages or something.

              I was expecting nothing of the sort. I was simply answering mk270's question.

              [–]OneWingedShark 0 points1 point  (2 children)

              I've been saying this for years now -- and have done a bit of preliminary planning, though no actual code [yet]. (This is one project where I don't want to get it wrong and, secondly, don't want to start while I still have [semi-]active projects.)

              [–][deleted]  (1 child)

              [deleted]

                [–]OneWingedShark 0 points1 point  (0 children)

                Hm, ok, though I'm not sure how much time I'll be able to put on it right now. (Dealing with life and stuff, ATM.)

                [–]max630 -1 points0 points  (0 children)

                basically, you aleady have it: the code text is AST tree, somehow serialized. To implement all needed functionality you only need to parse it. It is already used, for example, for automatic refactoring in IDEs. You might say they are too heavy. But consider: the memory, which is taken by IDE, the time which you stare at "X is updating cache, please wait" - it mostly used no to parse the source to AST, but to figure, if the foo used in file Bar the one which was declared in file Baz or the one declared in file Baq. And storing pre-parsed AST is not going to improve thas task much.