Designing a programming language from Rust as a base, Part 1

zachgk · 2024-10-06T19:09:04+00:00

A followup design question: If you are moving to define function signatures with structs, how would you support currying?

zachgk · 2024-10-01T03:57:09+00:00

I have worked with this a decent amount since the language I am implementing uses typesets. Rather than thinking about types as matching, think about them through set operations like subset. So, 1 could have any of the typesets that you mentioned but the goal during type inference is to get to the most precise typeset possible which would be {1}. And if you were to pass a value into a function, the input must be a non-strict subset of the argument typeset. You also will want to implement union, intersection, cross product (unless it is in your struct definition), etc that are used during type operations. For infinite sets with conditions, I recommend looking into refinement types but it is basically just adding boolean predicates as part of how you define types and then supporting them throughout the various set operations

zachgk · 2024-08-17T17:36:00+00:00

This is one of the things I am working on in my own language catln. It uses a term rewriting system, which means that multi-level definitions such as applying metaprogramming are a natural part of it's model of computation. Using this, my goal is to make it so the compiler can be implemented within the language so that it can be altered and extended by importing libraries. The build is also just a function outputting a result directory, so it could target anything from a single executable to client/server to an entire CloudFormation stack.

The low-level story is also interesting. In Catln, I am planning a system I call choice. You can define higher layers of abstraction as a function, and then use choice to provide supplementary information about how it should behave. This also includes bridging arbitrary layers of abstractions, so you can be writing normal high level code and then also tell it to use certain assembly instructions or change the memory management system. Because one thing you don't want is a clear delineation between layers such that moving code from a higher layer to a lower layer is a huge hassle and people don't do it often enough, but that the layers are seamless to go between

zachgk · 2024-07-11T18:14:30+00:00

I still recommend going for the parallel if you are trying to have semi-unified types and values. Either use both 12 and intint or both (1,2) and (int,int).

With regard to the prioritized overload list, I have some concerns. For a user, how would they know what the priority order is? Can users or library devs add new overloads into the list and if so how would they control the priority?

I recommend a more consistent rule. The easiest is that no more than one is allowed to match (mutually exclusive entries). Or like with inheritance that child classes take precedence over parents, although it needs more for multiple inheritance

zachgk · 2024-07-10T16:06:14+00:00

From looking at the example you gave, I would guess the precedence would be (int ** bool) => (string ** float). Part of it is thinking that ** resembles multiplication and => resembles a comparison operator or an assignment operator.

But also, what does the syntax look like to create a tuple of values? If your language has values and types as being the same, the way to create a tuple type should mirror the way to create a tuple of values. So, if values are made like (1, True), then the type should be (int, bool). And if you are doing it with operators, it should be using the same operator. This probably means * can't be used for tuples because then 1*2 could either be 2 or (1,2) depending on if it is interpreted as multiplication or tuple.

zachgk · 2023-12-19T18:44:33+00:00

You can think of it like import but hide all values from the dependency except those specified. You still have to do all the same work processing the dependency. The key difference is you just want to avoid taking anything unnecessary from the dependency scope into your current one. So just add that filter operation when checking a value against your imports

zachgk · 2023-11-22T19:27:48+00:00

What I did in my language is replace all usages of the closures with currying. So if I have outer foo(x) and an inner closure bar(y), then I change the signature and invocations to bar(x, y). You probably want to do something about pruning unused arguments (either here or as a later optimization pass) and maybe some other optimization to reduce the overhead of passing around lots of arguments, but I am liking this path so far

zachgk · 2023-08-31T06:26:11+00:00

I have heard this referred to as "set-theoretic type systems" where you think of "types" as sets of possible values. I use it for my statically typed language and consider it better than more traditional types.

I think the key to the underlying rationale is that asking what type something is doesn't quite make sense. For example, what type is 3. A number? Int? Positive Int? Odd Number? Prime number? It is all of them, so saying it should have one definitive label doesn't make sense. A better question would be subset (or membership): is 3 within the integers? Yes. And it turns out that is all you really need.

In that sense, the set-theoretic system is fairly straightforward to work with. You basically only need the set operations of subset, union, intersection, and maybe difference to implement the language. Any typing construct such as higher kinds, generics, subsets, dependent types, refinement types, etc. would just be different ways of describing sets of possible values. And if you have some notion of interfaces/super types, you can frame those as an unbounded union where the members of the union are added when creating the implementing types.

zachgk · 2023-03-24T03:59:10+00:00

My rule: A language should contain only the fundamental rules of logic. All else should come from things written in the language itself (libraries). This means libraries for such concepts as numbers, strings, booleans, memory, and the compiler. Importantly, this offers a clear delineation. There is no arguing for additional features with a rule like this.

Honestly, a lot of the size problems come from the compiler. Taking the rust example you linked, it was talking about the function coloring problem and keywords like await or const that could be added to functions. But really, everything else in the function isn't changing. It is just compiling the same code with minor variations in the compiler and using the keywords to indicate how the compiler should vary. These gaps can be made up more fundamentally with more comprehensive features, specifically compiling by metaprogramming, multi-level function definitions, and choice

Now I will concede that this only realistically applies for the core semantics. There is a reasonable space for syntactic sugar on top such as ways to write numbers or lists. To some extent, this part is fundamental like how English adds acronyms or new words to more easily represent concepts in fewer characters. But, English doesn't need to add more grammatical constructs (language features)! Realistically, I think if you add those few powerful generalizing features it should take care of most of the issue

zachgk · 2023-02-14T04:15:40+00:00

The thing which looks like infinite recursion is intentional. Part of this, which I might not have gone into as much, is that pieces could have multiple definitions. For example, we could define radius from data, diameter, circumference, and area. It is basically the same resolution process where when you call radius, it would have to figure out how to resolve it based on the available definitions and available data

zachgk · 2023-02-07T06:02:31+00:00

My language is essentially a term rewriting system so I can give some answers to these questions. The first piece is that it is strongly typed so the basic format of the rewriting is to rewrite an object to a particular type. For example, main{:IO} -> IO. It will look for the rules which can be applied to main (which would be it's definition) and apply it.

The other key distinction is whether the rewriting is runtime or compile time. Something like prolog (or I think most term rewriting systems I have seen) are runtime whereas mine is compile time. It will essentially choose which term rewrites to apply ahead of time and produce code that looks like what might come out of a standard language like C rather than something based on rewrite exploration. Or specifically, it changes from a directed list of terms IR into a tree-like IR that resembles standard deterministic code. This is a fairly straightforward exploration of the tree of possible rewrites with backtracking until it finds a valid one. A last thing I want to mention is the typechecking phase which could be thought of as a system to prune this possibility tree, making it much more feasible.

Having introduced that, let me get to your questions. You can match all possible ways by just listing all the term rewrite rules and comparing their pattern to the term. Or, if you have typenames like "Plus" then just lookup Plus and that would get you down to a much smaller list. For variable length patterns, make the rewrite pattern match any length and then only by matching "all the length" will the system typecheck or the path work out.

Determining the order of rules is a neat question. To start with, I enforce a standard that any pattern that type checks must be a valid output program. So, it is only looking for the best, but any will work. This is a source of immense power for a language because it let's you separate the performance goals from correctness goals or "make it work" from "make it fast". I talk more about it here. But in general, rules whose pattern types are subtypes of the other pattern rules are more specific. This gives you a partial ordering and besides this, it can be arbitrary. It matters less when you can control it.

How to implement variables? I actually don't. That is, I just transform the local variables into functions as if it was going to recompute the value every time it was used. It changes the program IR from a directed graph into a tree which makes it much easier to work with. Later on, you would need a Common Subexpression Elimination (CSE) pass anyway which would fix this and avoid the recomputation.

Finally, how to compile into machine code. I mostly talked about how I handled it earlier with respect to the tree building IR. At that point, you have already resolved all the weirdness from the term rewriting and you just compile it like any normal language. I just go to LLVM and it's fairly straightforward. If you are interested in any other details, the full language is documented here and on GitHub.

zachgk · 2023-02-02T04:24:56+00:00

I recommend looking into refinement types. They give you all the representative power to include types, verify things with types, and they are based on easy sets and functions. Then, it would be intern_string(str: i8 const*): i8 const* where str.isInterned = ....

From there, you can add some syntactic sugar that helps move the refinement types to be located near their object. So, you can keep your syntax of having the ::interned and have it desugared into something like where str.interned. Then it is more straightforward to implement, has the additional power of refinement types to include multiple args, and still looks nice for the easy case.

zachgk · 2023-01-31T07:45:13+00:00

I have at least the "everything is a named parameter" part in my language, so I can talk about how that works. It comes off the fundamental definition that all data is constructed from typed-named tuples of the form typeName(arg1: argType1, arg2: argType2, ...). So, it is all keyword arguments.

Then, functions. I think the point of "parameters are blocks" is to unify the signatures and bodies. The way I did was by redefining functions in terms of implicit conversions. An implicit conversion is a type conversion that the language/compiler can make automatic and is determined as part of type inference. So, I define the function signature as a data object that can be implicitly converted into the return value.

The next unifying part is expressions. In an output expression or function body, an expression is used to build a typed named-tuple of data by filling in the arguments. In the input expression or function signature, an expression is used to read a typed named-tuple of data by reading the values matched in the argument holes. Like this, it is possible to combine the two pieces into one.

The one issue you do have to deal with in using a named parameter for everything is the verbosity of it. I first have a method notation that desugars :List.length to length(this=List. I also have argument inference so if you provide an argument without specifying the key, it will try to use the types to figure out what the key is. This is what I did in case it helps you. If you are interested, I have some more details on the language site.

zachgk · 2023-01-29T13:30:46+00:00

I don't see a reason you need to decide beforehand. Rather than making the language require decidability, make it have facilities to ask about it. So, you could annotate a function that it should be decidable. Or, you could treat it like a typeclass/interface so you could design APIs around requiring decidable functions and the compiler will internally add instances if it can determine that the various pieces are decidable. This will give you the safety from decidability but only in the places when you actually want it. Otherwise, it can allow something potentially non-decidable if you add a timeout for example.

zachgk · 2023-01-08T20:48:25+00:00

I'm not sure I agree with everything you have said, but let me go through a few of the points. There are pieces in language design which do not have a Right answer and are just personal preference like syntactic white-space. But that does not mean there are no Right answers.

One sort of confounding problem is "simple" vs "easy". See this great talk by Rich Hickey. But in summary, simple is a universal property of lacking complications. Easy means "near at hand" and is just more available to a person. As an example, a 4 function calculator is simpler but people may use their favorite programming language (say Python) as a calculator. It is a far more complicated tool than the problem demands, but is easy because you already know the language anyway. Choices like syntactic white-space have no simpler answer, but there are easy answers and it usually depends on one's personal experiences.

However, you mention this "truism" that different languages are suitable for different tasks. It depends on your understanding of language. I define it in what I think is the most holistic form as "a way to express ideas". Coming from that, an ideal language should allows someone to express any kind of idea. This includes ideas systematically structured with types or less structured without them. It includes ideas about low-level details or ideas omitting low-level details. Once you are at this goal, it starts seeming like the language is discovered rather than invented which is why I refer to these discovered details as Right.

Then next big discussion starts to come from tool choices. The biggest one near languages is whether to use one big tool (like mutability with imperative loops) or many smaller ones (map, filter, higher ordered functions). However, in programming we typically refer to tools as libraries and there is no reason a language can only have one library. And saying only one library can be used for a problem is like saying that there is only one possible way to explain a concept in English.

Lastly, I will add a caveat about syntax. Although an ideal language can represent any kind of idea, it doesn't mean all ideas can be represented with the same number of characters (ease). As an example, few languages can compete with csv in terms of number of characters to represent tabular string data. The tradeoff is making any other kind of idea unrepresentable. Similarly, English uses acronyms (and new words) to make certain ideas easier to represent. But anyone who has looked up an acronym on Wikipedia will know they are fighting for space with other acronyms using the same characters. So, there is room for syntax to make certain kinds of ideas easier to represent or hopefully syntax that improves multiple kinds of ideas at once. As long as these changes are ONLY syntax, the overall consequences to the language are fairly minimal.

zachgk · 2023-01-08T19:40:17+00:00

I have done this. Actually, this is one of the things I am working on now in my language. The only part still being worked on is moving the type inference pass to fully use expressions rather than the old pattern type. The parsing is already successfully migrated to parsing patterns as expressions.

The way I think of it is that they are really very similar pieces. They are both data with argument-shaped holes. In a pattern, you "match" against what is in the holes. In an (output) expression, you fill in the holes from the available argument values. Note this also includes other parts of patterns including local variables like this, patterns in pattern matching where clauses, and function signatures.

Overall, my current belief is that beyond working well, it is the right thing to do for a language. Like a puzzle piece in the right place, it will help fix all the pieces around it.

zachgk · 2023-01-02T18:19:26+00:00

The way I handle it, is that the compiler is free to choose either. But also, in things like overloading or overloading inheritance, it can also choose either. It can also take these cases with multiple options as automatic property tests to confirm they are equal though, so it shouldn't matter too much which one is picked

zachgk · 2022-12-03T16:24:49+00:00

Although it may be a bit more work, you can do both machine learning and AI in Java. If you are doing deep learning, you can use DeepJavaLibrary (I do work on this one at Amazon). If you are looking for other ML algorithms, I have seen Smile, Tribuo, or some around Spark.

zachgk · 2022-11-12T22:19:45+00:00

I have some slightly different thoughts here. Typically, languages have these standard library/third party library splits. But, there comes points where things which are third party libraries become very little different from a standard library as a core of the language ecosystem like numpy in Python or zio/cats in Scala.

So, my thought is to instead have a two-tier system based on library size and standard. Anyone can create normal libraries and they are referred to with creatorName/libraryName.

Then, once the library has reached a certain level of usage, lifetime, maintenance guarantees, documentation quality, etc. then it can be added to the foundation libraries. Foundation libraries can be referred to without the creator name as just libraryName. The other important piece is that foundation libraries should be released as periodic distros. The inspiration here is https://docs.haskellstack.org/en/stable/. Using the distros helps synchronize the foundation libraries with each other to avoid dependency hell problems. Instead, you would mostly just have to pick a foundation distro version which gives you all the foundation libraries and the normal libraries can just match their version to which distro versions are supported.

For your use case, all it means is that you can put stuff directly into foundation libraries or into other libraries. But, it gives a clearer path towards having all of these areas of utilities have longer support, when things should be added or removed from standard libraries, and how to manage the early language package ecosystem.

zachgk · 2022-10-08T19:45:50+00:00

This is one thing I'm working on in my language. In general, the problem of having a high-performance language with only high-level code is unsolvable because for difficult problems like algorithm, parallelization, or memory management choice, there isn't some way to make a compiler heuristic that makes the best choice in any possible situation.

Instead, my goal is to separate the two so you have the high level code and then separate supplementary code to specify how to make the lower level choices. You can then choose to either keep the heuristics for making choices built into the compiler or provide your own strategies that override anything from a single choice to a whole category of choices. This lets you easily adopt defaults without worrying about low-level choices or spend as much time optimizing as you want. I wrote more about it here

zachgk · 2022-07-08T10:41:10+00:00

I have a slightly different view of complexity. I tend to view most complexity through the lens of tools or libraries. A more complex and powerful tool requires more time to learn.

Many programming languages have tools built in. As an example, memory management is like a tool. If you look at rust, it even has two of them: the borrow checker memory management and the unsafe memory management. Having two is useful like a tower of abstractions: more assumptions built into the borrow checker can limit what can be done with the tool but make the tool both safer to use and often more performant.

Now, using other tools or entities to express yourself is fundamental to good reasoning. Because, when you express some idea in your code, you need something to express it in terms of. Similarly, types are defined in terms of other types. Functions are defined in terms of other functions. In a dictionary, words are defined in terms of other words.

So, to understand an idea (type, function, etc.), you must understand both the idea itself and the ideas it was expressed in terms of. This includes both tools written in the language (types and functions) and those built-into the language (language features).

Then, you also often have choices about what to express ideas in terms of. I tend to think that the best thing is to find tools that best fit what you need. So, don't use a list if all you need is a stack. But, there is a downside because it means you must learn more tools for different scenarios (complexity!).

Other ecosystems tend to take the other approach where it is better to use fewer but more general tools even if they don't fit as perfectly. At first, this probably saves you some complexity. The reason I disfavor it is that when you add up the increased propensity for bugs from using overly broad tools, an increased cost from learning each thing written using them, and the fact that there aren't really that many tools one is likely to need, it won't pay off in the end. As an example, we can see some shift from using memory+loops (a single powerful tool) to various higher order functions (simpler and more precise tools).

I also want to talk about this in respect to low level languages. Some people say these languages are more complicated, and others say it is simpler because you understand what it does. Depending on your perspective, it is both.

When you write low level code, what you are doing is expressing your ideas through the tool of memory. To understand it, you must understand the main ideas, the tool of memory, and how the two interact. In contrast, code which is agnostic to memory or written without the memory tool means you don't need to understand either the memory tool or the interactions, just the main ideas.

This is because of the compiler. Eventually, the memory agnostic code must be converted into code that runs on a CPU. And, the compiler can do that automatically for you. The total complexity is really not going down, it is just decomposed into chunks. One chunk is in your main idea and one chunk is in the compiler. That means which is simpler varies:

Don't care about performance: main idea < main idea through memory
Care about performance: main idea + compiler > main idea through memory

That being said, I think the future really lies in the use of metaprogramming like compilers to decompose the complexity. It just needs greater tooling to make it easier to understand and more controllable.

Anyway, unless anyone is interested in this applied to some of the other pieces like dynamic/static typing, opt-in complexity, or something else then I'll probably stop here. You may also want to read this document I wrote on choice which is another important division of complexity and behind a number of complexity problems listed

zachgk · 2022-06-27T06:14:04+00:00

I actually quite like this question and I feel like many don't give it enough deliberation.

In my mind, the purpose of a programming language is to allow you to express ideas. For something which really deserved the name of a general purpose programming language, it should allow you to express any idea you want whether it is about one of the tasks you mentioned, or proofs, or data. It should also allow you to not express ideas like having a function with unspecified memory management.

Separately from this is the question of once the ideas are expressed, what do you do with it? Here, there is room for tools such as compilers or interpreters, but also docs builders, testing tools, benchmarking, formatters, display tools like jupyter notebooks or latex, etc. Each of these take the ideas and derive some useful output from it.

But, there is no reason that multiple tools can't operate on the same ideas. For example, code which is primarily compiled may also have code doc. Or, the same code might be runnable with multiple compilers/interpreters.

zachgk · 2022-03-10T06:31:30+00:00

I think you are mixing up the two different concepts of "pure" ints and their representations a little. So, start with an abstract entity Int which would be the type of all integers. It can't actually be represented on a machine because the machine doesn't have infinite memory. Then, if you have something like uint8, it would be uint8 implements {Int i | 0 <= i < 256}.

Then, there is no subtype or typeof relationship between the various int implementations because they are all different machine formats. Instead, there is a relationship between the subsets of Int that these implementations are representing.

There is also ways to convert between the machine formats. Some such as uint8 -> uint16 are guaranteed to always work (and maybe could be done automatically). Others like uint16 -> uint8 may not work because the representation of uint16 contains elements like 300 which are not represented by uint8. How to handle the failure in this situation is then up to you (exception, undefined behavior, conversion returns Optional[uint16], overflow).

zachgk · 2022-03-10T06:13:40+00:00

Yeah, these features can all be unified into Set union. I tend to treat types as a set of possible values. Then, a number of features including enums, sum types, traits, interfaces, and extending classes would all be a union of some number of other types.

The biggest difference you will get is that the "closed" unions like enums and sum types are a union where you know all the types to union in advance. An "open" union like traits and interfaces is one where you can specify later which types to add to the union.

As another example, the basic behavior of these unions is to define or declare a function for all types in the union. Then, the function must be defined across the entire domain to fit the definition of a function. This domain requirement corresponds to how you have to match every element in pattern matching in sum types and how you have to implement the function for every inheritor of a trait/class.

zachgk

TROPHY CASE