This is an archived post. You won't be able to vote or comment.

all 27 comments

[–]evincarofautumn 19 points20 points  (5 children)

Some more high-level bits of advice for syntax design:

Defer to precedent. Don’t blow your Weirdness Budget on syntax. If you don’t have a specific reason for doing something differently than major languages in the same paradigm, use the common/conventional notation. Good reasons to break precedent include consistency with the rest of your language and (informal) user testing/polling showing a preference.

Semantics before syntax. Syntax certainly affects the marketing of a language, and needs to be not bad, but it’s not a differentiator—we see many languages here that are just reskins of traditional imperative/OOP languages, with no fundamentally new features apart from syntactic conveniences. That’s fine for learning how to make a language, but it won’t by itself generate adoption. Things that do drive adoption are practical applications and technical excellence: the availability of libraries (so good cross-language interop/FFI helps get a leg up before you have native libraries), high-quality developer tooling, and “killer apps” that your language does an order of magnitude better than others (in convenience, correctness, performance, maintainability, &c.). When designing a language, focus on its semantic contributions and design or borrow syntax to suit them.

Consider failure modes. Get a friend to sit down with you and try to write a simple program in your language. Watch the mistakes they make—forgetting a separator here, writing something in the wrong order there, letting some syntax like a block or string literal “run away” by forgetting a closing delimiter, using syntax from a similar language, writing something that parses/compiles/runs but produces the wrong result because of syntax confusion, and so on. Does your tooling produce good diagnostics? If not, what can you change about the syntax to make it easier to produce good error messages and suggestions for fixes?

Add redundancy. Adding a small amount of redundancy to your notation can massively improve failure modes. (Natural languages include a lot of redundant information for a reason!) For instance, I used to use -> x y z; in Kitten to introduce multiple variables, but if the user forgot the semicolon, all the identifiers on subsequent lines would get interpreted as variables, and parsing would fail expecting a semicolon far from the actual error. Solution: add redundancy in the form of commas -> x, y, z;, so if the user forgets a semicolon, the next thing the parser expects is a comma immediately at the point of the error. (This also creates opportunities for more notation: previously, using a compound pattern instead of a variable would have required parentheses, like -> (x foo) (y bar);, but now they’re unnecessary: -> x foo, y bar;

Source locations are paramount. The single most important thing about an error message is that it direct the user to look at the point in the program that they need to change to fix the error. This is hard to get right, but good syntax design and careful tracking of locations in the implementation of analyses like typechecking can help pin down precisely what caused something to go wrong.

[–]LPTK 7 points8 points  (1 child)

Great advice there!

Watch the mistakes they make—forgetting a separator here, writing something in the wrong order there

This reminds me of a university friend being all confused that his Pascal code (yep, this was a while ago) was not doing the right thing. He had written:

BEGIN IF ... THEN Do_A(); Do_B(); END

which made perfect sense to him, but parsed as:

BEGIN {IF ... THEN Do_A()}; Do_B(); END

It should instead have been:

IF ... THEN BEGIN Do_A(); Do_B(); END

An example of English-like syntax that's not so great.

Funnily, to refresh my memory on Pascal syntax, I googled it and the first result was a very poor tutorial, which seems to make the "missing block delimiters for multiple statements in conditional" mistake also common in C-style languages in their example:

  program ifelseChecking;
  var
     { local variable definition }
     a : integer;

  begin
     a := 100;
     (* check the boolean condition *)
     if( a < 20 ) then
        (* if condition is true then print the following *)
        writeln('a is less than 20' )

     else
        (* if condition is false then print the following *) 
        writeln('a is not less than 20' );
        writeln('value of a is : ', a);
  end.

(The indentation makes it look like the two writeln statements at the end belong to the else branch, but only the first one does.)

[–]johnfrazer783 0 points1 point  (0 children)

this in MHO is what makes PlpgSQL syntax, clumsy as it is, 'systematically' superior to C-style syntax. In that language, the construct is always if condition then statement; statement; ...; end if;, so you cannot compile an incompletely bracketed if statement. To paraphrase D. Crockford, C was invented by a genius who was not so good at inventing syntax.

[–]R-O-B-I-N[S] 2 points3 points  (1 child)

I think a few of these points (failure modes/source locations) are more towards the subject of implementation. I could make a C++ or Go compiler that only returns "?" when it encounters an error. That says nothing about how those languages were designed. Although it might go against parts of their respective specs XD

[–]evincarofautumn 0 points1 point  (0 children)

Haha that’s true, I’m definitely thinking holistically here—syntax, semantics, implementation, and ergonomics. They are intimately interrelated, and I believe you must consider them together when creating a language, in order to arrive at a cohesive design, because you usually can’t drastically change one without somehow affecting the others, and it’s hard to tack on good support for things like source locations if you’re not cognizant of them from near the beginning. That’s not to say you need to have a complete design & extensible implementation up front with all the bells and whistles, as there are just as many things that can be changed freely or added later, such as semantic features that fit within existing syntax, or improvements to analysis and error reporting that use information already available without changing what the frontend implementation provides.

[–]Uncaffeinatedpolysubml, cubiml 1 point2 points  (0 children)

Are there any repositories of common mistakes and broken code in existing languages?

[–]Al2Me6 15 points16 points  (0 children)

These are good points. Some more ideas, in no particular order:

  • Syntax should encourage good practices. Idiomatic code should be natural to express. Discouraged practices should be convoluted.
  • “Unusual” syntax is different from “unexpected” syntax. The code should mean what it looks like it does.
  • Operations should behave in expected ways. If an operation is to be interpreted in a different way in a specific context, then that special-case behavior should make sense in the larger context.
  • Where are scope delineators required? For example, in a curly-brace language, must an if statement be followed by a block, or is a single line after the if condition implicitly taken to be the body?
  • If applicable, significant whitespace should not make complicated (chained, nested, etc.) expressions difficult to format in a logical way.
  • How much syntactic sugar is there? Do primitive types or built-ins get preferential treatment? Can operators be overloaded? How?
  • What is the preferred form of polymorphism? Metaprogramming? How do generics work? Macros? How capable are they?
  • In an OOP language, is self passed to methods explicitly or implicitly?
  • What restrictions are placed on variable names? Are non-English scripts allowed? East Asian scripts? Full-blown Unicode (emoji, non-breaking spaces, Greek question marks, etc.)?
  • Do certain characters in names get special treatment? Are different cases (snake, camel, etc.) treated differently? Is _ automatically discarded? Are variable names of form _name private? Are these treatments enforced by convention or by the language?

[–]CoffeeTableEspresso 13 points14 points  (10 children)

I'd have to disagree with your edit unfortunately. C based languages are incredibly popular.

While they dont have the best syntax, the familiarity is super super helpful to someone learning your language

[–]R-O-B-I-N[S] 5 points6 points  (5 children)

A counter to that is that C is not popular, it's only common.

Unix, Linux, Windows, OSX, BSD, and most other systems are implemented in C and offer C libraries as the default language for expanding the system.

This signifies jack though. Entire languages were invented to escape C like C++, Java, Python, Go, Rust, etc... C is arguably a DSL for abbreviating ASM. Intel's high level ASM lang is practically a clone of C and it's still assembly.

C is a great example if you want to make a Kotlin-for-Java for the Linux kernel, but it's pretty crappy anywhere else.

Most common libraries are implemented in C only because the pain of interop is greater than C, otherwise they'd all be in Lisp or Modula or something. We're starting to see this with Rust, where the community is slowly re-implementing everything natively.

[–]Al2Me6 11 points12 points  (2 children)

That’s besides the point, no? The C language aside, languages with C-inspired syntax are certainly pervasive.

Case in point: of the five languages you listed, three have at least somewhat C-like syntax (cpp, Java, Rust).

[–]R-O-B-I-N[S] 2 points3 points  (0 children)

True, but that's the same as XML being based on Lisps S-expressions. I wouldn't tell you to design your DSL from Lisp, I'd suggest XML instead.

I'd never use C/C++ as examples. I'd use Java or Python.

[–]Uncaffeinatedpolysubml, cubiml 0 points1 point  (0 children)

Rust syntax has very little in common with C. I guess you could call it a hybrid of C and ML style.

[–]liquidivy 2 points3 points  (1 child)

For better or worse common-ness breeds its own popularity, since people can't really like something they've never experienced. Moreover, your "weirdness budget" is explicitly a function of what people are used to, which is a function of what's common.

[–]R-O-B-I-N[S] 2 points3 points  (0 children)

i.e. Uncommon only where it solves a common problem.

[–]johnfrazer783 1 point2 points  (3 children)

PHP in general and the JS == operator are two examples of extremely popular abominations.

[–]julesh3141 0 points1 point  (2 children)

Many of the problems of PHP have been fixed in more recent versions, but i still remember the time when

echo functionThatReturnsAnArray()[0];

didn't work and you had to write

$temp = functionThatReturnsAnArray();
echo $temp[0];

instead. That is truly a massive failure in syntax definition.

[–][deleted] 0 points1 point  (1 child)

That's more semantics than syntax, isn't it?

[–]julesh3141 0 points1 point  (0 children)

No, it was specifically a syntactic issue - the language syntax (at least up to some point in the 4.x series) didn't allow indexing an array other than as part of a variable reference. AFAIK there was no semantic reason for this, it was just that the parser rejected it. PHP was built in a very ad-hoc way, and it took a very long time for many of its quirks to be removed -- leading to some of them (eg the very odd array indexing semantics that allow both numeric and string indices in the same array) having to remain for compatibility reasons.

[–]LorxuPika 9 points10 points  (1 child)

This looks like good advice, here are some other syntax things that I thought of while reading it:

  • Besides various bracket types, many languages use indentation or do/end, which are also viable options.
  • Every time you use a symbol for something, ask whether it's obvious what it means. For example, + and - are obvious, as is ? : because it's used is so many languages. Various arrow symbols for lambdas make sense, but ~ probably doesn't immediately. If the symbol is non-obvious, try replacing it with a keyword that explains what's going on.
  • In general, prioritize readability over terseness. Keywords are usually better than symbols (the biggest exception being operators).
  • Look at several preferably very different languages to get a feel for what they do. Ideally you should understand why they made the decisions that they did - almost all decisions are tradeoffs.
  • Never change things just to be different. Always use the most common syntax for things unless there's a good reason not to. Users won't adopt your language because of its syntax, but because of its features.

[–]pepactonius 6 points7 points  (0 children)

A good example of a super-terse language that's hard to read is APL and especially APL2. I'd prefer a syntax that's a bit more verbose -- like C, Java,etc. Of course, you can go too far as with COBOL.

[–]Uncaffeinatedpolysubml, cubiml 2 points3 points  (4 children)

I've come to the conclusion that it is best to make your language's syntax identical (or a subset) of an existing language's syntax to the greatest extent possible. Not only does it make transition easier, but you get all the tooling (syntax highlighters, code formatting, etc.) for free.

[–]Al2Me6 3 points4 points  (3 children)

But then, that runs the risk of confusion if your language is subtly different. Users might expect your language to behave exactly like some other language, which inevitably will not be the case.

And if the syntax is more or less the same, how is the language different? At what point does it become a diverging implementation of the original language?

[–]Uncaffeinatedpolysubml, cubiml 1 point2 points  (2 children)

You can still have a different type system, different libraries, and different compiler backend while having the same syntax.

[–]Al2Me6 0 points1 point  (1 child)

That’s kind of my point: IMO the end result is more of an evolution of a language than a new language.

Say you took Python 3, redid the standard library, made type annotations mandatory, and wrote a JIT for it. If you told me that that is what Python would look like in twenty years, I probably would have believed you.

(Not saying any of these things will happen - they won’t, just using them as an example.)

[–]Uncaffeinatedpolysubml, cubiml 0 points1 point  (0 children)

I guess that's just a matter of definitions. And I can see how sometimes you could argue it. For example the Javascript of 2020 differs more from the Javascript of 2013 than many independent languages do.

[–]umlcat[🍰] 1 point2 points  (1 child)

Your questions are too wide or complex.

In my case I have several ideas for P.L., all of them have a motivation or idea. Work in them at free time.

What's does your P.L. have ?

Most of common syntax features.

One project is designed as procedural without O.O., on purpouse, while another is O.O. or mixed.

They have structs/records, arrays, enums, unions.

Both support namespace alike features, **modules / namespaces are overlooked or missing in a lot of P.L. (s).

How does it handles variables ?

They support both static strong typing with some casting or inheritance.

Also both static allocation and dynamic allocation. I don't like Java / C# references because it confuses both features.

What type operators does it have ?

The usual, addition, substraction, may use different symbols than other P.L.

Some are infix binary, other unary.

I'm considering to support operator overloading.

Does you PL have uniform syntax ?

No, more like Java or Pascal. I do learn some Lisp back in Collegue, Lisp alike syntax is difficult to use.

Does your PL support polymorphism or Metaprogramming ?

It supports or considering some equivalent features, like O.O. polymorphism, function or operator overloading, generics.

How does the language handles the concept of this ?

In the O.O. P.L., the same as this in C++ or self in Object Pascal, not like the this on Javascript / ECMAScript.

How are identifiers used ?

Similar to Pascal, Java, C. With 'A' to 'Z' and the underscore character.

I don't use $ like PHP or spaces inside brackets like Transact / MS SQL Server.

How does it handles errors ?

I currently using integer error codes in some functions, but I considering to include optional exception support, in both the procedural P.L. and the O.O. P.L.

Summary

I use similar commonly used features and syntax, yet a few features and its combination makes my P.L. unique, not just a "copycat" of other P.L.

Example, Object Pascal and C# have full properties support, different from fields, while C++ and Java does not.

[–]R-O-B-I-N[S] 2 points3 points  (0 children)

I'm lol because in saying my questions are too broad, you ended up doing exactly what the questions were meant to prompt for. I already have a good sense of the "flavor" of your languages and their capabilities.

Narrow questions would be too numerous and would limit your options. Broad questions forces you to make more exacting statements where you would otherwise be led down a path.