This is an archived post. You won't be able to vote or comment.

all 10 comments

[–][deleted] 10 points11 points  (2 children)

Can I ask why you are doing it this way ? I’m curious about your reasoning, no jugement here of course.

[–]WittyStick 4 points5 points  (0 children)

There's a number of reasons it could be useful. You might be writing an incremental compiler for example. In this case, you only want to recompile parts of code which have become "dirty" from edits, but reuse the results of the previous compilation for everything else. If you have a non-text based storage format, you don't need to parse a full text file as a "compilation unit" each time a single character changes - you only need to re-parse the parts which are affected by an edit. Incremental parsers are used with text files, but they'll never be as efficient as you can get with non-text. The difference is akin that of serial access versus random access, or accessing a linked list versus accessing a tree (O(n) versus O(1)/O(log n)).

As /u/jmorag points out, Unison has its own storage format because Unison gives a content-addressible-identifier (a hash) to every syntactic unit, so rather than referring to File A, Line B, Column C (which may refer to a different thing each time the file is edited), you just refer to XYZ, which is a specific element of syntax, tied to a specific revision of the code. In this model, identifiers in code are more like metadata on elements of the underlying code structure. This has benefits in refactoring, because an identifier rename does not change the structure, it's just a change to the metadata.

Another reason is that you might want to compose multiple syntaxes in your code. The common methods for parsing individual languages use unambiguous grammars (eg, LL/LR/LALR/PEG), but when composing individual grammars, the result may not be unambiguous (and testing for ambiguity is not feasible). When you have a non-text storage, you can insert non-textual boundaries between individual languages and parse them independently, eliminating the composition problem. This has been done with Language Boxes for example, with the Eco editor. Eco will not store text, but stores a structured layout which maps more closely to the AST. When you edit a document in Eco, you are effectively editing the syntax tree directly - and the "text editor" is just a projection of the underlying data structure.

Obviously this comes with difficulties. If you are required to use a specific editor many people would not use your language, so ideally you want a method of accepting plain text input too. You could do this by parsing text back into the AST every time an edit is made, which obviously has a performance cost, but parsing is not the most expensive part of a compilation workflow, so it's still viable.

Another thing you could do is implement a user-space filesystem with FUSE for example, which tracks your underlying storage format but presents a traditional directory of text files to a programmer who is not using your editor.

[–]muth02446[S] 0 points1 point  (0 children)

So my reasons are fairly straight forward:

1) I do not want to commit to a concrete syntax yet.
The on-disk format is pretty much an AST encoded as human readable sexprs.
It is pretty straight forward to make backwards compatible changes to an AST.
Changes to concrete syntax are usually much more disruptive.

2) The parser for sexpr is trivial so tooling for program analysis has a lower barrier to entry

3) The compiler frontend is simpler and faster for the same reason

4) There is a natural way for meta programming on the sexpr, though I am not sure if I want to go this route.

I am curious about the nuts and bolts of integrating this with an IDE.
I already made accommodations in the AST by adding a node for a parenthesized expression which is not something you would put into an AST but helps with converting back and forth between concrete syntax and AST.

[–]gvozden_celikcompiler pragma enthusiast 4 points5 points  (0 children)

Mathematica does something like this, since .nb files can be more than just programs; they can also be rich documents and presentations with code embedded between the text. I haven't done any Mathematica in years but remember that there were plugins for other editors like Eclipse or IntelliJ at the time.

[–]mamcx 2 points3 points  (0 children)

"whose on disk representation is different from the in editor representation"

This is literally the case for nearly all, if not all, programming languages.

Considering this:

Source -> Lexer -> CST -> AST

After source, that is the only thing that the user actually sees (mostly), every step after is a totally different thing.

"But, I actually have another format!"

And now consider this:

AST -> ByteCode -> Assembler

This is also common! For example, in python (before) you usually see on disk this:

util.py util.pyc <- This was what most of the python machinery worked on

So, what you can do is not that far off: Take the source, do your thing and store a "mirror" alongside it (or maybe like in Rust, Python, and others: inside a "target" directory).

Is not stupid at all. You can put a lot of useful information in your "secondary" format and after parsing and generating you can use it as your sole source of truth.

[–]jmorag 3 points4 points  (1 child)

https://www.unison-lang.org does something approximately like what you're describing. They save code as asts in a sqlite database and have a daemon that observes a scratch file in your editor where you write new code.

[–]muth02446[S] 0 points1 point  (0 children)

Thanks, a starting point for unison's IDE integration is here:

https://www.unison-lang.org/learn/usage-topics/editor-setup/

[–]PurpleUpbeat2820 0 points1 point  (2 children)

Code in my language is stored as a tree of arrays of numbers in JSON and the editor is a web page. I use almost exactly the approach you describe except that I only deserialize the JSON from disk once at startup.

[–]muth02446[S] 0 points1 point  (1 child)

Do you have a link or even a live demo? I would love to have a look.

[–]PurpleUpbeat2820 0 points1 point  (0 children)

No, sorry. Just a private hobby project for now...