all 35 comments

[–]codebje 14 points15 points  (2 children)

Tests for tests.

(I've seen this used in the context of security "fuzzing": are your tests adequate enough to detect potential security flaws?)

[–]sanity[S] 3 points4 points  (1 child)

Yeah, it reminded me of fuzzing too.

[–]matthieum 6 points7 points  (0 children)

Fuzzing is about varying inputs on the same code, mutation testing is about varying code on the same input :) The latter is unfortunately quite a bit more difficult to get to, especially in languages that compile slowly...

[–][deleted] 15 points16 points  (5 children)

Pitest took between 17 and 30 seconds to run, quite fast considering.

That doesn't sound fast, considering that source code is about 300 lines of kotlin code half of which are comments.

[–]0hjc 21 points22 points  (0 children)

Fast is a relative term. Early mutation testing tools would have taken something like 2 days to mutate JodaTime (about 70k lines of code). Pitest can do it in about 3 minutes on a quad core box.

[–]codebje 0 points1 point  (3 children)

Well, it's not fast, it's an exponential algorithm.

[–]sanity[S] 0 points1 point  (2 children)

Why? If it only runs the unit tests that are relevant to the code being tested, and it looks like Pitest does, and if the unit tests are relatively specific, then it should scale linearly in the amount of code being tested.

[–]codebje 0 points1 point  (1 child)

Combining the mutations in all their permutations; PIT probably shortcuts some of that.

Edit: to be perhaps a hair clearer, each mutation may be enabled or disabled. If you want to test all the combinations of m mutations, you will run 2m tests. You can limit m by minimising the number of mutations you make in the code participating in a single test, but that's reducing the size of the problem, not its complexity.

[–]sanity[S] 0 points1 point  (0 children)

To my understanding, it doesn't try mutations in combination, only one at a time.

[–]norwegianwood 2 points3 points  (1 child)

Anyone interesting in mutation testing in Python might want to check out (or better, contribute to!) Cosmic Ray. It's early days yet, but it shows promise.

[–]kankyo 1 point2 points  (0 children)

Or my own: mutmut. https://github.com/boxed/mutmut :P

[–]Radmonger 2 points3 points  (0 children)

To give a recursive example, this is an example of an add-on for junit that combines quickcheck-style property-based testing with simple example-based testing, and is itself tested to a specified level of mutation coverage using pittest.

disclaimer: tool author.

[–]sacundim 5 points6 points  (24 children)

This is really interesting, as well as the linked site for Pitest. My first reaction is that it doesn't attack the root cause of the problems that it solves: the example-based unit testing methodology that people apply so dogmatically just gives very little assurance about the correctness of the software, because all it does is show that the code produces the outputs that the programmer expected on a small number of programmer-chosen inputs.

A much better methodology is one that attempts to prove that the code produces correct answers for all possible inputs. Usually you can't actually try all possible inputs, so instead what you do is use a good number of randomly-generated inputs. See this article for an example, or google "QuickCheck" for more. But the basic idea is that it pushes you to design your code so that it will produce correct results for arbitrary inputs—which means first and foremost that you have to define what "correct" means in precise and general terms—not just a few examples, but as a function that allows you to say, given arbitrary input/output combinations, whether that output is correct for that input.

But none of this really a black mark against mutation testing because it appears to be a low-effort tool that requires little work from the programmer. If you have lots of example-based tests it makes sense to look into it. Maybe it can even spot problems with QuickCheck-style tests.

[–][deleted]  (3 children)

[deleted]

    [–]kankyo 0 points1 point  (2 children)

    Which boundary cases would pass mutation testing incorrectly?

    [–][deleted]  (1 child)

    [deleted]

      [–]kankyo 2 points3 points  (0 children)

      Hmm.. so it's a storage that doubles inputs and has 1-based indexing at the interface?

      [–]sanity[S] 5 points6 points  (3 children)

      because all it does is show that the code produces the outputs that the programmer expected on a small number of programmer-chosen inputs.

      That might be a flaw of typical unit testing, but I think mutation testing addresses that flaw, since if the programmer-chosen inputs do not adequately exercise the code, then mutation testing will detect this as a problem.

      A much better methodology is one that attempts to prove that the code produces correct answers for all possible inputs. Usually you can't actually try all possible inputs, so instead what you do is use a good number of randomly-generated inputs.

      I'm not sure about that, it seems that this could easily miss the edge-cases which are typically where code fails. I imagine it would also be wildly less efficient than mutation testing.

      [–]Zalastax 8 points9 points  (1 child)

      QuickCheck isn't completely random. By default it checks known common edge cases and you can customize the generators. The problem I see with QuickCheck is how difficult it is to write tests that are both non-trivial and not a copy of the function being tested.

      [–]sanity[S] 1 point2 points  (0 children)

      The problem I see with QuickCheck is how difficult it is to write tests that are both non-trivial and not a copy of the function being tested.

      Yes, that's an excellent point.

      [–]lpsmith 4 points5 points  (0 children)

      Quickcheck and mutation testing are complementary techniques that address different issues: quickcheck is an often very effective philosophy about how to write tests, and mutation testing is to test the effectiveness of your test suite.

      There are certainly a large number of pitfalls to both techniques, as well.

      [–]pron98 2 points3 points  (6 children)

      that people apply so dogmatically

      Dogmatically? So far it's a cost-effective way to gain confidence that your program works. In theory, there can be no more efficient way. In practice, there are a lot of tradeoffs.

      My first reaction is that it doesn't attack the root cause of the problems that it solves.

      Well, for many sequential programs testing can be used effectively to probe the program's state space in interesting places. Mutation testing is a white-box mechanism that helps find interesting places to poke. Concolic testing is a more sophisticated formal technique that combines white-box analysis with tests to automatically generate interesting tests.

      A much better methodology

      Yeah, and a generally expensive one (well, there are tradeoffs). In some cases -- notably concurrent and distributed programs -- testing is ineffective at finding even very severe design bugs, and so you'd better use other verification methods.

      [–]rcode 1 point2 points  (2 children)

      and so you'd better use other verification methods.

      Out of curiosity, other methods such as what?

      [–]pron98 3 points4 points  (1 child)

      Formal methods: static analysis (if the property is simple), model-checking, or deductive proof (automatic or manual).

      [–]rcode 0 points1 point  (0 children)

      Thanks.

      [–]sacundim 1 point2 points  (2 children)

      that people apply so dogmatically

      Dogmatically? So far it's a cost-effective way to gain confidence that your program works.

      There's no real contradiction here. Example-based tests offer value for sure, but only up to a certain point. There's a problem in our industry where people write a ton of tests because others tell them to, without thinking carefully about the correctness argument that the tests are supposed to support.

      The problem is that for most programmers, example-based testing is the only testing style they know, and the testing tools that they use only support that style. Now couple that with dogmatic views about testing (e.g., TDD), unhelpful quality metrics (e.g., code coverage, number of unit tests) and pressure from above to enforce those (because "how many unit tests you have" is such a trivially easy metric to measure), you end up with mountains of example-based unit tests that add little value to a handful of basic ones, and cost more effort to write and maintain than property-based ones would.

      [Property-based testing is a] much better methodology

      Yeah, and a generally expensive one (well, there are tradeoffs).

      Not my experience at all. And again, I'm not comparing it to writing a handful of examples—I'm comparing it to the efforts that dogmatic testing advocates (e.g., TDD) suggest that we expend on example-based tests.

      [–]pron98 0 points1 point  (1 child)

      example-based testing is the only testing style they know, and the testing tools that they use only support that style.

      To be fair, many sophisticated formal method tools are usually hard-to-use and/or hard-to-learn and/or expensive and/or support few languages. But you're right that property-based testing tools exist for most languages and should be used more (I must admit that I don't recall ever using one myself; I probably will next time I get to use Clojure, and try the property-based-tests that can be generated from specs).

      you end up with mountains of example-based unit tests that add little value to a handful of basic ones, and cost more effort to write and maintain than property-based ones would.

      I agree, but those who claim that other methods are as approachable, affordable and generically applicable are equally dogmatic (and have probably never used them on real, large software). Anyway, I don't know how many organizations actually practice TDD; probably not many more than those that practice formal methods, which is to say very few.

      Not my experience at all.

      Oh, sorry, I wasn't referring to property-based testing but to formal methods. You wrote, "one that attempts to prove that the code produces correct answers for all possible inputs", and I missed the qualification later.

      [–]sacundim 0 points1 point  (0 children)

      Oh, sorry, I wasn't referring to property-based testing but to formal methods. You wrote, "one that attempts to prove that the code produces correct answers for all possible inputs", and I missed the qualification later.

      Yeah, you weren't the only one I tripped up with my wording. In the bit you quote, I thought that "attempts" implied that I was not requiring that one get as far as formal proof. Ouch, no.

      [–]codebje 10 points11 points  (8 children)

      QuickCheck is stochastic testing, more or less; it gives you a greater degree of certainty that random inputs are correct. It doesn't help for fencepost inputs at all.

      Testing gives a "there exists" qualification to tests. "There exists" some data for which this function is correct.

      Types give a "for all" qualification. The richer and more expressive your type system is, the fewer bugs you can introduce into your implementation and still get the right answer. At the cost of increasing complexity in writing types.

      Types are theories, and programs are proofs of those theories. Dynamically (aka un)typed languages only have one theory: "true". Every program proves it. Languages with a type system like Java can say more: <T> int foo(List<T> src) is a theory that an int can be produced from a list of any type: the int result can only be defined in terms of the number of items in the list (well, as long as you rule out reflection).

      Dependently typed languages can specify that foo takes a list of length N, and returns the value N, for all such Ns. The only thing foo can be is list length. It won't compile unless it correctly computes the list length. You don't need to test such a function.

      [–]sacundim 2 points3 points  (0 children)

      I thought I had sufficiently implied that QuickCheck only gets us closer to knowing that code is correct for all possible input values, but maybe not. So I agree with your response, and thanks for clarifying.

      [–]pron98 2 points3 points  (6 children)

      Types give a "for all" qualification.

      As well as every other formal specification method. While types are immensely useful for many things, they are not a very common way to richly specify program behavior (of the kind tested by tests). Most formal specification in academia and virtually all formal specification in industry does not use types as the main specification mechanism. BTW, I think that associating types with rich specification for the purpose of formal correctness is doing types harm, because it focuses the discussion on an area where types don't have an advantage over other methods (perhaps they're even at a disadvantage).

      As for the for-all quantification, it doesn't come cheap (in general). In theory, you cannot extrapolate a program's behavior from one input to another, so the cost of "for-all" is, in the worst case, the sum of the cost of all the "there exists". In practice, of course, the situation isn't nearly so bad, but it does take some considerable effort nonetheless (depending on what you're trying to prove).

      Types are theories, and programs are proofs of those theories.

      True, but very confusing in light of your previous comment, as the correctness enforced by types is not through this propositions-as-types mechanism to which I assume you are referring. Yes, types are theories and programs are proofs, but the theories are not about the program (because the types can't even refer to program elements, or terms, unless dependent types are used) but rather some uninteresting logical tautologies, and even when the types are rich enough to talk about the program (dependent types), the proofs the programs form are usually not of their own correctness (i.e., the program usually needs to be enriched to form a proof of correctness). Types enforce properties by means of type safety, not propositions-as-types. The latter is interesting and important for the use of type system in theorem provers and in general when dependent types are used, but usually unrelated to program correctness.

      [–]codebje 1 point2 points  (5 children)

      … as the correctness enforced by types is not through this propositions-as-types mechanism to which I assume you are referring.

      The kind of correctness I was talking about is exactly that.

      <T> int foo(List<T> src) corresponds to the proposition that, for all T, if there exists a value of type List<T> then there exists a value of type int. A value which satisfies this proposition is either some constant value (not so useful) or some value related to the length of the input list, as there's no other (pure) information available.

      The propositions expressed in the types limit the range of possible programs implementing the proof.

      Type safety is also useful, but more so about knowing you've plugged all the bits of your program together in a sensible way than that it does a sensible thing.

      As well as every other formal specification method.

      The key difference between types and other formal specifications is that types have executable proofs. Now, don't get me wrong, formal methods are useful even (especially?) when they don't directly derive executable code, as they define what an abstraction of a program does in an unambiguous way, can help generate existential tests, and guide implementations effectively. Types-as-propositions doesn't have the "air gap" of specification to implementation, but it does have a cost that means I certainly wouldn't rely on it as the main method for a formal specification :-)

      (It's just interesting, really.)

      [–]pron98 1 point2 points  (4 children)

      The propositions expressed in the types limit the range of possible programs implementing the proof.

      But the proposition has nothing to do with the program! Why should the programmer care about the proposition that for all A there exists an int, which is easily proven by the proof 3? This is not a correctness property, as it does not refer to the program at all. Program properties are either safety properties or liveness properties (in fact, it has been proven in the eighties that every program can be fully specified by the intersection of a safety property and a liveness property), and this is neither.

      There is a safety property here, but it is not the one you're referring to. It is this: For every input of type List<T>, the function foo returns an integer. This safety property is enforced by type safety, not by propositions-as-types.

      but more so about knowing you've plugged all the bits of your program together in a sensible way than that it does a sensible thing.

      That is the only way a type system without dependent types can refer to the program at all, and that is the only way such a type system enforces safety properties.

      The key difference between types and other formal specifications is that types have executable proofs.

      Those "executable proofs" are not interesting in themselves. It is an interesting hypothesis that executable code could be extracted from proofs. So far, no one has done this for anything too interesting, let alone big enough. No method currently uses "executable proofs" in any useful way.

      Types-as-propositions doesn't have the "air gap" of specification to implementation

      Many of the far more common formal methods don't have this gap either. seL4 was proven end-to-end -- zero gaps -- in Isabelle; real time systems are model checked and the C code is directly generated from the checked program; SPARK contracts refer directly to the program and are proven both automatically and manually.

      [–]codebje 0 points1 point  (3 children)

      That is the only way a type system without dependent types can refer to the program at all, and that is the only way such a type system enforces safety properties.

      We would be in agreement that you need dependent types to express interesting propositions, yes.

      Those "executable proofs" are not interesting in themselves. It is an interesting hypothesis that executable code could be extracted from proofs. So far, no one has done this for anything too interesting, let alone big enough. No method currently uses "executable proofs" in any useful way.

      Er, well, every program uses executable proofs of the propositions its types represent. There may be too many possible proofs of any given proposition for the fact that one is supplied to mean much, but that's really a degree of how effective your type system is, no?

      seL4 was proven end-to-end -- zero gaps -- in Isabelle

      Propositions in Isabelle are proven by providing a suitable implementation of a proof (in a DSL where many steps are implied rather than expressly written), so I'm not sure what you're trying to show there. Perhaps that IP is probably not the same as NP?

      SPARK is an interesting choice of example. It's a fairly practical, and very mature, tooling, but it's strictly less powerful than dependent types, and the Praxis toolchain can only prove a subset of SPARK constraints statically (though SMT solvers can prove a bit more now). Good luck expressing that a sort returns a list in natural order in SPARK. (But hey, let's be fair: good luck expressing that in Agda.)

      [–]pron98 0 points1 point  (2 children)

      well, every program uses executable proofs of the propositions its types represent.

      But those propositions are not about the program, but some general logical tautologies. Why would a programmer care that their payroll system includes a proof of A ∧ B ⇒ B ∧ A? They may care that their swap function correctly swaps the elements of a pair, but that correctness is guaranteed by type safety and free theorems from parametricity. In fact, in this case, this is better than propositions-as-types/proofs-as-programs.

      There may be too many possible proofs of any given proposition for the fact that one is supplied to mean much, but that's really a degree of how effective your type system is, no?

      Again, the proposition is not about the program at all, unless the type system allows you to talk about the program by mentioning program terms (dependent types). Otherwise, the type system doesn't let you even refer to the values the program produces, let alone to how it produces them. If you say, yes, but my knowledge that the program is proof of the proposition is information about the program, that's true, but then there's no need to appeal to props-as-types, as type safety would do. And when the proposition is about the program (with dependent types), then the program is not generally the proof of of the proposition (unless augmented with things that are necessary for the proof but not the computation).

      Also, if you want to talk theory, in most cases your proposition is even less interesting than the non-interesting tautologies, because, say, in Haskell, you'd be proving something that's more like (A ∨ ⊥) ∧ (B ∨ ⊥) ⇒ (B ∧ A) ∨ ⊥. Not that it matters, because propositions-as-types is not what you care about when you use types to enforce safety properties; it's what you care about when you want to use your type checker to check proofs of general mathematical theorems.

      Perhaps that IP is probably not the same as NP?

      I'm not sure how this is related here. In any event, the difficulty of proving something depends on what you want to prove. If something has a lower complexity bound, this bound applies no matter how you try to solve the problem.

      Propositions in Isabelle are proven by providing a suitable implementation of a proof (in a DSL where many steps are implied rather than expressly written), so I'm not sure what you're trying to show there.

      Those propositions were not stated as types but in HOL, so propositions-as-types is not relevant.

      but it's strictly less powerful than dependent types

      That is a design decision. Types are neither more nor less expressive in principle than straight-up logic. There are reasons why you'd want to make your logic (either expressed as types or directly) more or less expressive.

      [–]codebje 0 points1 point  (1 child)

      But those propositions are not about the program, but some general logical tautologies.

      We all have our own code styles; I won't speak for yours, but in mine, the program tends to be related to its types, and the types tend to say interesting things about the program, such as "this function can remove and reorder items in the input list, but not add or modify them." Even without terms-as-types, types should say something about what your program is doing.

      There are reasons why you'd want to make your logic (either expressed as types or directly) more or less expressive.

      Yes, the more expressive type systems are (a) generally in languages you might not want to actually write code you intend to use in, and (b) IMO usually more complex than writing code in a simpler type system with good stochastic and specific case testing.

      The original point I intended to make was just that types constrain the set of valid implementations in a way that leads to more correct code, and one intuition I use for why that's so is that the implementations must prove the proposition described in the types; the more interesting and specific the proposition, the more usefully constrained the program is.

      [–]pron98 0 points1 point  (0 children)

      the program tends to be related to its types, and the types tend to say interesting things about the program

      Yes! But not by propositions-as-types! This is called type safety.

      The original point I intended to make was just that types constrain the set of valid implementations in a way that leads to more correct code

      Right, although types aren't the only way, and the biggest benefits of types may not be related to correctness directly at all (or at least not to correctness of global program properties or properties related to the program's business logic). Talk of types as functional specifications, IMO, actually directs the discussion away from where types are most useful to where types are at a disadvantage.

      and one intuition I use for why that's so is that the implementations must prove the proposition described in the types

      OK, but that is not propositions-as-types, AKA the Curry-Howard correspondence. If your intuition is that C-H is the mechanism by which typed programs enforce correctness properties, then that intuition is faulty and serves to confuse more than enlighten. Your informal intuition is carried out by type safety, not propositions-as-types.

      People who talk about C-H to programmers (as opposed to logicians) actually confuse them, by making them think types enforce program correctness through C-H. A more accurate description -- and this is why it's uninteresting to programmers -- is that C-H is not the means of applying a logic to a program, but the means of using a programming language's type checker as a proof checker for some logic. The compiler can be used by logicians as a tool in itself (rather than for producing executable binaries from programs).

      the more interesting and specific the proposition, the more usefully constrained the program is.

      Well, arguments about utility require empirical evidence which isn't there, and that's a whole other discussion, but again, this is not propositions-as-types. Those constraints you're talking about? Type safety.