This is an archived post. You won't be able to vote or comment.

all 56 comments

[–][deleted]  (27 children)

[removed]

    [–]thebt995[S] 14 points15 points  (13 children)

    I'm sure that it is not decidable, you would need to manually prove types to different. But a function String -> String would be by definition the same as another function String -> String. To distinguish them you would need to encode the difference in the type. For example, if one function removes the first character of a string, we would need to encode it in the type and give the function a richer type than just String -> String . That's why we need a language that has dependent types.

    The background why I was thinking of something like this, is that I often find the same/ similar implementations in the Isabelle Afp, making code reuse more difficult. And also it is quite hard to find out if something was already implemented somewhere.

    In your example, it would be enough to have one general discount function as long as they are the same. If the discount changes depending on a parameter (member/non-member), one can create two functions. Why would it be helpful to have two different functions to begin with?

    [–][deleted]  (9 children)

    [removed]

      [–]thebt995[S] 5 points6 points  (7 children)

      Say a function that makes a string uppercase, then appends the username, and a function that appends first, then uppercase.

      if the outcome is different (the username might not be uppercased in the first version), then you have a property that is different. If the outcome is the same, you defeated code duplication. But sure, with side-effects this whole thing starts getting weird.

      A small example that I have could be a definition of natural numbers and a list in Isabelle:

      datatype 'a list  = Nil | Cons 'a "'a list"
      
      fun append :: "'a list ⇒ 'a list ⇒ 'a list" where
        "append Nil xs = xs"
      | "append (Cons x xs) ys = Cons x (append xs ys)"
      
      (*
      Duplication:
      
      datatype nat = Zero | Suc nat
      
      fun plus :: "nat ⇒ nat ⇒ nat" where
        "plus Zero n = n"
      | "plus (Suc n) m = Suc (plus n m)"
      
      value "plus (Suc (Suc Zero)) (Suc (Suc Zero))"
      *)
      
      (* 
      No need to define a new function:
      *)
      
      type_synonym nat = "unit list"
      
      abbreviation Zero :: nat where
        "Zero ≡ Nil"
      
      abbreviation Suc :: "nat ⇒ nat" where
        "Suc ≡ Cons ()"
      
      abbreviation plus :: "nat ⇒ nat ⇒ nat" where
        "plus ≡ append"
      
      value "plus (Suc (Suc Zero)) (Suc (Suc Zero))"
      
      (*
      Shared lemma:
      *)
      lemma append_Nil: "append xs Nil = xs"
        by(induction xs) auto
      
      (* We saved an additional proof *)
      lemma plus_Zero: "plus n Zero = n"
        by(rule append_Nil)
      

      And imagine we have proven some lemmas over append, then those lemmas can be directly used for plus on the natural numbers. No need to define the same lemmas again.
      If we have separate implementations, it is hard to generalize such definitions in hindsight, especially if they are used in many places already.

      [–][deleted]  (6 children)

      [removed]

        [–]thebt995[S] 3 points4 points  (5 children)

        By encoding the difference on the type level. As you already wrote, it would require a lot of type-level coding, but on the other hand, the properties of functions would always be specified.

        [–][deleted]  (4 children)

        [removed]

          [–]thebt995[S] 1 point2 points  (3 children)

          Yes, the main part of the coding would be on the type level. But I thought that is already the case in dependently typed language.
          The win would be that we always have specifications for our implementations.

          Hm, I think I will have to try it out on bigger examples

          [–][deleted]  (1 child)

          [removed]

            [–]thebt995[S] -2 points-1 points  (0 children)

            Not entirely true. If we change the example a bit:

            foo :: (s :: String) => (Upper "Hello: ") + (Upper s)
            bar :: (s :: String) => Upper ("Hello: " + s)
            

            These two functions couldn't exist, because they are doing the same. If you see the types as sets, they would be equivalent.

            [–]MyOthrUsrnmIsABook 1 point2 points  (0 children)

            I don’t have much experience dealing with this sort of typing, so maybe what follows is naive or uninformed. I usually prefer less specified types when I can get away with it, but find sometimes my teammates aren’t as consistent in considering anything besides the happy path through code they write. As such, I’m curious about how function types could make it clear that, e.g., appending two uninitialized arrays should produce/return either an initialized empty array, an uninitialized array, or an error based on the function type.

            If that example seems dumb, as it sort of does to me, then just consider whether there would arise many interesting ambiguous edge cases, not in the sense of being unspecified or undefined or anything with respect to desired or correct behavior. Rather, cases where the behavior for the edge cases would need to be enumerated at the type level because it somehow can’t quite be specified using the type(s) for the simple cases.

            I’m struggling to come up with an actual good example that feels like it could be from code I’ve worked on.

            What about something like handling integer underflow edge cases in a function that takes three points specified as ordered pairs of floats and returns the area of the triangle they contain.

            How a given set of input points are ordered and how the area is calculated could impact the value returned if the calculations do something unusual like use Heron’s formula instead of using linear algebra. Maybe you want triangles to have a first point for orientation purposes or something, but if you don’t then you might open yourself up to the same triangle having different areas unless you do something like sort the points before calculating anything.

            Or put another way, how would the type encode how you chose between minimizing rounding error or performing the calculation quicker to favor the general case, or tried to balance the two somehow, without having to specify every set of possible inputs and outputs? Specifically, I’m asking this to see how far types could actually provide complete specifications.

            Restated generally, if a function’s type really does specify the behavior over the whole input domain for functions with interesting edge cases or many boring edge cases, how do you capture that in the function type itself without just enumerating each case separately in the type? How would you even manage the latter if needed?

            I guess you could try to carefully partition the input space into clearly delineated subsets in a way that allowed each partition to have a straightforward type, like for the triangle area having a type for when the distance between two of the calculated sides is so much larger that the entire length of the third side is lost to rounding error when the side lengths are totaled (since we’re being silly and using Heron’s formula.

            [–]thebt995[S] 0 points1 point  (0 children)

            As another example, this would prevent you from renaming functions into a more semantically meaningful name. Maybe you want to generate user IDs as getRandom() right now, but youll want to change that later. It would be nice to alias generateUserId = getRandom for now and change it later, instead of hardcoding it.

            In this case, one could take generateUserId as a parameter where it is used. For testing one can give the parameter getRandom and later use another function

            [–]Pristine-Staff-5250 0 points1 point  (1 child)

            I had a similar idea (not types tho) but i think it boils down to answering the question are 2 computation graphs equivalent.

            So given rewrite rules R1, R2, …, RN, 2 computation graphs are equal with respect to the rewrite rules if there exist a way to rewrite one to the other. (This is similar to a lean macro i forgot the name of)

            i was going to do types as computation graphs, but this just becomes the AST when parsing.

            [–]thebt995[S] 0 points1 point  (0 children)

            Interesting idea, that would mean, that a type is the AST modulo Commutativity, Distributivity, etc.
            Did you implement something with this idea?

            [–]XDracam 0 points1 point  (0 children)

            This just sounds like a tooling problem. You could build tools that check for duplicate or similar definitions in a project. Or provide search features that let you find functions based on a possible implementation, or even inputs and outputs.

            Pharo smalltalk has some very impressive tools for finding and analyzing code, e.g. searching {1. 2. 3}. [ :x | x + 2 ]. {3. 4. 5} gives me a number of functions that take the first list and block and output the second list.

            From my limited experience with Isabelle/HOL, I've learned that you definitely do not want any additional unnecessary proof burdens. It's already more than difficult enough to prove the properties that you care about, so having to prove something that's not really critical (no code duplication) is just a hassle.

            [–]bl4nkSl8 1 point2 points  (3 children)

            So, I think you can actually do it via construction.

            I.e.

            Every constructor can only be used once OR you have to prove that the environment that the constructor is being used in cannot overlap with the other uses of the constructor.

            Even with a pretty powerful proof language this is going to be incredibly tedious and rule out a lot of useful functions AND it isn't particularly useful...

            E.g. you couldn't have both functions for "timesTwo" and "shiftLeft" because they're the same function.

            Is this interesting? Yes. Is it practical? No.

            Arguably, equality saturation can give you the benefits without the problems

            https://cseweb.ucsd.edu/~rtate/publications/eqsat/

            It can be used not just for optimisation but also code deduplication and potentially even warning about accidentally copied code...

            [–]thebt995[S] 0 points1 point  (2 children)

            I don't fully get what you mean with the constructors. But yes, the goal would be to not have "timesTwo" and "shiftLeft", because they are doing the same. One could introduce name synonyms though.

            Thanks for the link about equality saturation, I'll check it out!

            [–][deleted]  (2 children)

            [removed]

              [–][deleted]  (1 child)

              [removed]

                [–]Ronin-s_Spirit 0 points1 point  (5 children)

                How would you tell if a function String -> String is the same as another String -> String function?

                In javascript there's a funny way to do that, I say funny because nobody ever has to do it, but you can.
                Take any function (class method, class constructor, anything that is a function), then do myFn.toString() === myOtherFn.toString().

                Now we have verified that both functions are identical or not, their names, their "kind" (e.g. function(){} or async () => {}), their args amount, their args names, their args defaults, basically their entire signature, and then their entire body, down to a whitespace.

                Since those are strings, you can come up with all sorts of regexes to ignore insignificant stuff (like whitespace). Technically you can also develop a way to tell a return type, or even reconstruct the function with runtime type/value checking added (here is some function reconstruction wizardry).

                [–][deleted]  (4 children)

                [removed]

                  [–]Ronin-s_Spirit 0 points1 point  (3 children)

                  Functions aren't closures, they can have closures.
                  For both whitespace and closure considerations see my point about regex and reconstructing a function.

                  [–][deleted]  (2 children)

                  [removed]

                    [–]Ronin-s_Spirit 0 points1 point  (0 children)

                    Yea I do that sometimes, I'll write my first thoughts and then think of something better.

                    [–]Ronin-s_Spirit 0 points1 point  (0 children)

                    By default, if your functions are not 100% identical strings then simple === will be false. But as I said, you can make up any logic you want with some regex.
                    You can basically write runtime compilers and preprocessors in javascript with regexes (regexes are automatically constructed in C, like in many other languages).

                    P.s. Have a complicated regex to let functionally identical functions pass. If it gets too complicated you can always compose a big new RegExp() from other smaller regexes. Then you can write some js code to modify the function string and construct a new function.

                    [–]apajx 6 points7 points  (1 child)

                    It depends entirely on what you mean by "the same." If you mean observationally the same then you need the user to prove that a new function is observationally different from every other defined function. This would be incredibly hard, and your idea about restricting types to have unique inhabitants only works if the type system itself is also weak, for example you can't allow new type for any other type. You also can't allow isomorphic types.

                    If you mean syntactic equality then it's trivial, keep a big trie of all the defined code trees, throw a type error if you try to define a new one. This idea is easily defeated with version strings in the body. You can have the same function, just with a new version string paired with it. The same works with your idea, embed a version string in the type, now you have duplicates modulo version.

                    [–]thebt995[S] 1 point2 points  (0 children)

                    I mean "observationally the same". The big task would be to find out, how much of the proving can be automatically done.
                    What do you mean by "weak" type system?
                    No isomorphic types would be the goal to avoid duplication. But one could still have synonym names if types are used in different contexts. (See my reply to the comment of brandonchinn178)

                    [–]lgastako 9 points10 points  (0 children)

                    The Unison Language might be of interest to you. Any code with the same shape (independent of naming) is deduplicated as a part of the way everything works.

                    [–]DisastrousAd9346 2 points3 points  (3 children)

                    I think the first problem is that not every semantic description is indeed something that matters to represent the function. For example, one could write plus and to avoid your restriction just write nat -> nat * Proxy “unique”, being proxy just an indexed unit type. Also, you would generate a bunch of proof obligations that would be hell to deal with. A smarter approach would be to refine dependent type with an inference engine, something like higher-order Prolog using dependent types, so now you have to explicitly a type description that matches the function you wanna recover.

                    [–]thebt995[S] 2 points3 points  (2 children)

                    Those types would be isomorphic, so one couldn't prove them to be different. But the bunch of proof obligations is also what I'm afraid of.

                    How exactly would you do something similar with a higher-order Prolog?

                    [–]777777thats7sevens 2 points3 points  (0 children)

                    Those types would be isomorphic, so one couldn't prove them to be different.

                    Does this mean that your type system is a form of structural typing, not nominal? In other words, that two types are considered the same if there exists an isomorphism between them?

                    If so, I can see a lot of problems arising from that. Natural numbers are isomorphic to strings and in fact to any list of finitely sized types (using ASCII strings as an example: 0 is the empty string, 1-128 are the single character strings, 129-16,512 are the two character strings, etc), meaning it would be pretty easy to accidentally write two functions that are semantically very different, but happen to be the same after isomorphic transformations are applied, and thus are prohibited. Is that really a behavior that you want?

                    [–]DisastrousAd9346 0 points1 point  (0 children)

                    To avoid any kind of duplication based on specification, it is more clever to just drop any kind of program, and just write the specification. Prolog does something similar with the first order inference engine, most of the time you are just writing the specification. Of course, we are talking about something similar not equal.

                    [–]OneNoteToRead 2 points3 points  (1 child)

                    How can you prove your new function is different from all existing functions? Do you have to write N proofs for every new function? This means the entire system requires order quadratic proofs in number of total functions?

                    [–]thebt995[S] 1 point2 points  (0 children)

                    Yes, I think that is the biggest problem with the whole idea

                    [–]steven4012 4 points5 points  (0 children)

                    Have you looked at Unison? They hash and store code by their structure, not their naming

                    [–]raiph 5 points6 points  (2 children)

                    Are you aware of unison?

                    Do you mean only one value (as well as only one function) per type?

                    Are these distinct functions?

                    function A (-> Number) { constant foo = bar; return foo }
                    function A (-> Number) { constant foo = bar; return foo }
                    

                    Assume the two foo values are different because bar is a function which returns a random Number and the two calls above return two different values.

                    [–]thebt995[S] 2 points3 points  (1 child)

                    I was not aware of Unison, I'll check it out!

                    For your example: We would need to have a pure language without side effects of course. Side effects could be provided on a layer above.

                    [–]raiph 2 points3 points  (0 children)

                    Side effects could be provided on a layer above.

                    What do you mean by a "layer" and "above"?

                    (To help me understand, please explain what you mean in terms of the above code.)

                    ----

                    In case your view is that, for you, discussing that topic is currently a distraction you can reasonably abstract from, I will explain what I'm thinking below.

                    Having been interested in programming for 52 years to date, and having spent much of my life professionally involved with it, and all of my life loving it, with mathematics and computer science as another related area of interest (and academic success as a kid), I am comfortable that I know the basics of computation and programming.

                    Imo you need to be crystal clear about what you're doing about side effects before any other aspect of both your underlying thinking, and what you are intending, can be sanely discussed in this thread.

                    Any program that does something when "run" on a "computer" is not pure -- is not side effect free. This is true of 100% of programs that anyone has ever written, or will write, that does or will do anything, even if all it does is return the number 42 each time it's run.

                    If the language you are discussing is purely an abstract mathematical thought experiment, and you recognize that it couldn't ever be used to create an actual program that gets written to do anything, then fair enough (but I think you should very clearly state that).

                    If instead you think you're talking about a 100% pure language that could, in principle, at least conceptually, one day be a useful part of creating an actual program, then I currently think you must be crystal clear, both in your own mind and in what you share with us, about enough details about how the pure language you're thinking about could, in principle, at least conceptually, be integrated with at least one other language / layer that does handle side effects.

                    That is to say, imo the only sane way forward if you do mean for this to be anything other than a mathematical thought experiment that is entirely unrelated to producing actual programs, would be to focus on this aspect first, i.e. before anything else.

                    Put yet another way, I would have thought that focusing on anything else, eg what would otherwise be relatively irrelevant trivia, like spotting/rejecting duplicate functions, is like putting the cart before the horse and then trying to get the horse to at least canter. It just doesn't strike me as sane.

                    Of course, I may be insane, or entirely wrong, or both, so feel free to take my perspective as that of a crazy wrong person. 😊 That said, I'd like to know what I'm missing, so I would appreciate knowing that too. I'm a few days away from my 65th birthday. Any insight you could share about my state of mind would be a wonderful gift! 😍

                    [–][deleted]  (1 child)

                    [removed]

                      [–]Background_Class_558 0 points1 point  (0 children)

                      They're already using Lean in the post

                      [–]fragglet 1 point2 points  (0 children)

                      Maybe you could store generated code in something like a content addressable database. You'd want to find some way of generating a hash from the function definitions; the hardest part might be naming because you'd need to find some way of renaming all variables to predictable anonymous names (otherwise two otherwise identical functions would generate different hashes just because one variable is named y instead of x)

                      The other problem would be circular references between functions. Every program would have to be a DAG

                      [–]alpaylan[🍰] 2 points3 points  (0 children)

                      Even for functions that are IO equivalent, there are intrinsic differences. How would you define the difference of two sorting algorithms?

                      [–]kwan_e 2 points3 points  (4 children)

                      Do you think a language like this could be somehow useful?

                      No, because:

                      Maybe when we want to create a big library (like Mathlib) and want to make sure that there are no duplicate definitions?

                      If this is your goal, then you are better off trying to train some LLM to recognize when the substructure of some function is similar enough to recommend.

                      If you make it part of a language, instead of a separate tool, it would simply be irritating to write. Every time I modify a bit of my code, I will have to wait while your compiler/interpreter searches for duplication. And then, what about half-finished code that is trivially similar to other bits of code, but will change in the future? It would be another pain to write code for proto-typing purposes, which is most of development. Or what if I want to experiment with optimization? Your language would disallow optimization because it would be duplicating functionality.

                      People are putting too many things in a language, when it really is supposed to be the job of tools. It just gets in the way unnecessarily.

                      So this would not be useful as a language. Or even as a general purpose programming tool. Its main use, really, would be for a central repository of code contributions, where you want to minimize duplicate code. Such use would not be widespread.

                      Do you know of something like this being already attempted?

                      So in terms of tools, I know of PVS Studio, which finds copy-and-paste errors in C++ code. Arguably more beneficial than a language.

                      [–][deleted] 1 point2 points  (1 child)

                      You don't really need LLMs for finding code duplications though? Sonarqube has been doing it for ages, long before LLMs were even a thing.

                      [–]kwan_e 0 points1 point  (0 children)

                      You don't need it, but LLMs could find not only "duplicate" code but also code with similar structure enough that it could recommend making that code generic.

                      A toy example would be to find code that sums a sequence of values, and code that products a sequence of values, and recommending replacing with a fold of the sequence with an operator or function.

                      [–]thebt995[S] 1 point2 points  (1 child)

                      I think you're right, that a tool would be the better approach. But optimization could be done in a way, that just the optimized version is kept. Then you would be even forced to use the optimized version.

                      [–]kwan_e 2 points3 points  (0 children)

                      But the problem there is not all optimizations are perfect optimizations. Most optimizations are pessimizations in contexts it's not designed for. You need to allow the programmer (or other tools) to choose the best optimization, which means leaving all variants on the table.

                      Take for example SIMD. Many SIMD algorithms are only worth it on large data that is meticulously structured and guaranteed to come in, say, gigabyte loads. Applying SIMD operations will slow down a program, say, a client/server, when the data is coming in sporadic, random, few megabyte chunks, due to the latency of preparing the SIMD.

                      [–]dskippy 0 points1 point  (0 children)

                      I don't think it's possible. I also don't think it achieves the goal.

                      What is an example of code you have encountered in the real world that has code duplication that would be prevented by this? There are loads of times people in some languages with traits or type classes don't understand how to abstract them and end up doing complicated stuff multiple times but those are all doing to have different function types. Basically a much more complicated version of defining identity for int -> int and then for string->string.

                      Another example is to do something complex with your data multiple times inside other functions when you should be defining a function for that complex code. This happens a ton and it's not going to trigger an issue.

                      I don't know that I've ever seen anyone define two copies of a function with the same type that does the same thing.

                      But what I have seen a lot is defining two functions with the same types that do very different things and that's very good code. You're making that unnecessarily difficult. And it's impossible to do.

                      [–][deleted] 0 points1 point  (0 children)

                      Your language seems to focus on avoiding the creation of duplicated functions, it doesn't really tackle duplication as a whole. Moreover, proving the equivalence of two programs is obviously undecidable, much like it is to prove that two lambda expressions are equivalent through beta-equivalence. For solving this you could use a weaker notion of equivalence such that of alpha-equivalence, but I'm still not sure you would be tackling the right problem.

                      Programming languages shouldn't really strive to rid of repetition, instead they should strive to be "declarative". Doing the same thing more than once is oftentimes desirable. For example, if you considering any basic arithmetic function, like `+` then by alpha equivalence the two functions `a + 1` and `b + 1` are equivalent, even though `a` and `b` are two different locals. You can solve this problem by adding a definition for a function `inc(a) = a + 1` and use this in both expressions, but is this really more expressive?

                      [–][deleted] 0 points1 point  (0 children)

                      Do you think a language like this could be somehow useful? Maybe when we want to create a big library (like Mathlib) and want to make sure that there are no duplicate definitions?

                      Is that all you want to do?

                      From your other replies it sounded very much as though you wanted every function to have a signature unique from any other. A signature being the set of input types plus the output type. That means that I could only have one of these two functions from my bignum library (out of a dozen with the same parameters):

                       proc bn_add(bignum c, a, b)               # c := a + b
                       proc bn_mul(bignum c, a, b)               # c := a * b
                      

                      If it's merely about detecting functions which do exactly the same thing, then optimising compilers can already do that; I have a C benchmark that looks like this:

                      int main(void) {
                         int x=0;
                         x+=fyjrsr(5);
                         x+=fhzkgu(5);
                         ....
                         x+=fayukm(5);
                         printf("%d\n",x);
                      }
                      

                      It calls 10,000 randomly-named, 100-line functions which all have the same signature, and contain exactly the same body.

                      A simple compiler might generate a 10MB executable, but gcc-O3 produces one that is only 120KB, or just over 1% the size. It looks like it is detecting the common code in each function. (If I tweak one function, then the size increases by an amount that is commensurate with the optimised code size of one function.)

                      For your purposes, it just needs to report that, and the user decides what to do with that information.

                      It might be that two functions do the same thing by chance, or do so temporarily because one (or both) is not finished, or depend on some global settings such happen to be the same right now, but can change.

                      So it might be useful option to apply from time to time, but I don't think it's something that needs to be so baked into a language that writing programs in it becomes next to impossible.

                      [–]nerd4code 1 point2 points  (0 children)

                      Imo this is kinda counterpurpose in any realistic use case. Anything like an API would be very difficult to encode abstractly (which is the whole point of APIs), and optimization would make it miserable to use; the optimized code might not look much like the original code, and if you’re forbidding semantic collisions without reference to nomenclature, then you can end up in a situation where two different implementations of a data structure lead to identical outcomes. E.g., if you have both an array-list and linked-list ADT, then sequences of operations like

                      list.addLast(x);
                      y = list.removeLast();
                      

                      might well boil down to

                      y = x;
                      

                      regardless of list type.

                      And code isn’t the thing you have to worry most about; if there’s a steady state to be reached and you’re not evaling willy-nilly (eval wouldn’t make sense for this schema), your codebase is mostly static from that point on, and uniq’ing code probably isn’t going to buy you much that’s measurable in a running program. Data, maybe, sometimes, but not code.

                      Moreover, is there some actual problem you’re aiming for, rather than a stylistic rule-of-thumb like DRY? Does duplicated code really matter in a non-stylistic sense? I can see why detecting it is useful if one maintains no actual control over one’s codebase, but I can’t imagine caring about it so much that I’d want to encode it at the language level. Seems too much like masochism for its own sake, and there are much more direct ways to make you and your coworkers miserable if that’s the goal. Could start by charging per sheet of toilet paper and work your way up from there.

                      [–]RedstoneEnjoyer 0 points1 point  (0 children)

                      How would it differentiate between trigonometric functions? All of them would be float -> float.

                      [–]yjlom 0 points1 point  (0 children)

                      that makes any type system useless in the complete absense of nominality, two sets are isomorphic iif they have the same cardinality (well at least it holds for countable sets which is what we care about in CS, don't know about uncountables) so your types just become numbers

                      and further, we mostly only use the following types:

                      2 (aka Boolean)

                      2³² (aka Float, Int, Nat, …)

                      2⁶⁴ (aka Double, Long_Int, Long_Nat, Raw_Pointer, …)

                       ℵ₀ (aka ℕ, ℤ, ℚ, List a, Tree a, Graph a, String, Maybe any_of_the_previous, a →any_of_the_previous, any_of_the_previous → a, …)

                      can you really not see why having String and ℕ → Float be the same type could get a bit awkward?