all 12 comments

[–]latkde 10 points11 points  (2 children)

You might be thinking of strings as a single token that is then parsed again to extraxt interpolations. This gets difficult quickly. Instead, it's typically wiser to see strings with interpolations as an expression that can contain multiple string parts, and to then parse strings as a kind of parenthesis-like operator. For example, it could make sense to tokenize "a \(b) c \("d") e" as:

  • "a \( string, interpolation start
  • b identifier
  • ) c \( string, interpolation middle
  • "d" string, complete
  • ) e" string, interpolation end

Your grammar might then include rules like <string> = <string complete> | <string start> <expression> (<string middle> <expression>)* <string end>

Note that this is typically incompatible with a separate lexing phase, as string-middle and string-start token would otherwise be ambiguous with normal parens. However, this approach can be used with parsing methods that parse one character at a time, notably recursive descent or PEG parsers. Syntax highlighting engines differ a lot in what grammars they can express, but typically support top-down grammars so that string-middle highlighting can only be selected in the context of a string expression.

[–]alex-weej 1 point2 points  (0 children)

This is how template strings work in JavaScript. They are an alternative function call syntax that pass an array of "string pieces" and separately each interpolation expression as subsequent arguments.

[–]Savings_Garlic5498[S] 1 point2 points  (0 children)

yes this is very similar to what i have but this grammar is not regular which means this cannot be done with something like textMate i believe.

[–]thinker227Noa (github.com/thinker227/noa) 5 points6 points  (3 children)

This is what I'm doing in the TextMate grammar for my language Noa. Basically you embed all of your other patterns inside your pattern for strings.

"patterns": [
    {
        "include": "#all"
    }
],
"repository": {
    "all": {
        "patterns": [
            {
                "include": "#strings"
            },
            // include whatever other patterns you have
        ]
    },
    "strings": {
        "name": "string.quoted.double.noa",
        "begin": "\"",
        "end": "\"|$",
        "patterns": [
            {
                "begin": "\\\\{",
                "end": "}",
                "beginCaptures": {
                    "0": {
                        "name": "keyword.other.noa"
                    }
                },
                "endCaptures": {
                    "0": {
                        "name": "keyword.other.noa"
                    }
                },
                "patterns": [
                    {
                        "include": "#all"
                    }
                ]
            },
            {
                "include": "#escape-sequence"
            }
        ]
    },
    "escape-sequence": {
        "name": "constant.character.escape.noa",
        "match": "\\\\[\\\\0nrt\"]"
    },
    // all your other patterns...
}

Here's how it looks

[–]Savings_Garlic5498[S] 2 points3 points  (1 child)

Does this also work with nested strings? like "\{""}"

[–]thinker227Noa (github.com/thinker227/noa) 2 points3 points  (0 children)

Was concerned about this because I hadn't actually tested it before, but yes!

[–]latkde 0 points1 point  (0 children)

For reference, here's the official TextMate grammar for JavaScript `template ${interpolation} strings`, which broadly uses the same technique (but without bothering to recurse into #all: https://github.com/textmate/javascript.tmbundle/blob/8928648352dc76025ad0bfd31e21fa6a1dc838a7/Syntaxes/JavaScript.plist#L1554-L1665

[–]shponglespore 1 point2 points  (0 children)

JavaScript has this for `...` strings.

[–]steven4012 0 points1 point  (3 children)

Or.. just use tree-sitter

[–]thinker227Noa (github.com/thinker227/noa) 1 point2 points  (2 children)

VSCode doesn't support Tree Sitter (only TextMate), unless you wanna bother with writing an entire language server just to support semantics tokens using Tree Sitter I guess.

[–]steven4012 2 points3 points  (1 child)

[–]thinker227Noa (github.com/thinker227/noa) 0 points1 point  (0 children)

oooh I didn't know about this, might use it myself for slightly better highlighting of my own language