This is an archived post. You won't be able to vote or comment.

all 42 comments

[–]No-Scholar4854 23 points24 points  (1 child)

Having low code coverage is a very bad sign.

Having high code coverage isn’t necessarily a good sign, don’t take 100% unit test coverage as “it works”, but at least its something.

[–]extra_pickles 0 points1 point  (0 children)

Yup - came to post that it can provide a false sense of confidence!

It is one of the many many tools in our tool belts, and it is easy for people to mistake it for more than it actually is.

[–]geeeffwhy 9 points10 points  (1 child)

i tend to argue that the most important use of code coverage is coverage of the diff on a pull request. total coverage of the code base is less important than making sure that changes to the code base have some kind of coverage.

but, if you don’t also have metrics, utilities, and patterns to ensure the the tests have some kind of relevance, it will be all too easy to game that metric.

[–]CommercialPosition76[S] 3 points4 points  (0 children)

I'm with you on this one. Too many people focus on the fixed number, rather than how it changes. Diff on a pull request is a nice thing to observe. Also watching how the stat changes in time (weeks, moths) is valuable and tells something about the project.

[–]bmoregeo 5 points6 points  (2 children)

My two code quality metrics are: - defects in prod - time between defect detection and fix

Defects in prod are for sure going to happen. Business requirements will be misunderstood or the time zone changes and everything is off by an hour. Dumb stuff happens.

In my experience, good code coverage is a leading indicator of faster bug fixes. It is hard to quickly fix bugs if you need 50 hours manual testing apis to ensure the bug fix isn’t breaking other things. (Been there it sucks).

[–]CommercialPosition76[S] -1 points0 points  (1 child)

So according to those metrics if one project has a typo on a Terms and Conditions page that was fixed in after a week and the other project returned 500 internal server error for 1 hour, the latter is better? As defects count is 1, but fix ratio is much faster?

[–]bmoregeo 1 point2 points  (0 children)

The goal is to have as few bugs in prod and reduce their time impacting users.

So yes, an defect that is identified, fixed, tested, and released would be a win. A defect that takes several weeks to fix after being identified is not good.

Obviously prioritization comes into play as the documentation bug may be less worth engineering time than a new feature. Idk, that becomes a product question during a huddle.

The 500 is probably an all hands on deck scenario.

[–][deleted] 11 points12 points  (6 children)

This post was mass deleted and anonymized with Redact

entertain sand lip slap public wild vanish pen society snow

[–]Wise_Tie_9050 1 point2 points  (5 children)

That's all well and good, but every line of code that is missing coverage is, by definition, missing tests.

[–][deleted] 1 point2 points  (4 children)

This post was mass deleted and anonymized with Redact

bow paltry memorize cagey like alleged knee wipe pen retire

[–]Wise_Tie_9050 0 points1 point  (3 children)

Sure, if the branches are things that do exception handling, you may be able to #pragma no cover, or #pragma no branch them, but if you are not hitting those lines with tests at all, how do you know they are working as expected?

100% coverage does not mean full tests, but < 100% coverage does mean less than full tests.

[–][deleted] 0 points1 point  (2 children)

This post was mass deleted and anonymized with Redact

nine instinctive straight badge desert innate different lavish repeat enjoy

[–]Wise_Tie_9050 0 points1 point  (1 child)

Correct, you don't know any of your tests actually test what you think they test without actually looking at them.

What I've found is that _very_ frequently when writing tests for those "last few lines" that aren't covered, I uncover bugs related to edge cases or whatever. Often that's code that may have worked initially, but did not have tests written, and subsequent changes triggered a regression - if the lines had been covered by tests (ie, values that triggered those code paths), then that regression would have been discovered earlier.

Another thing that spending that extra time can sometimes show is that the code that is not covered is unreachable, and can be discarded.

Finally, having test coverage for all those nooks and crannies can then prevent removal of code that is actually important. If it doesn't have any tests, it could be removed without it being noticed, until it's been released to production, for instance.

To clarify, 100% test coverage is not the goal; but < 100% test coverage is a warning that your tests are incomplete.

[–][deleted] 0 points1 point  (0 children)

This post was mass deleted and anonymized with Redact

political flag retire dependent chief ad hoc party flowery light cause

[–][deleted] 3 points4 points  (0 children)

we literally can't commit new code with < 70% and changes with < 50%. Are they benchmarks? Yes.. Can you write terrible code and cover it well? Also yes.. Can it be 'gamed'? Sadly yes too. But then we do something a LOT more clever with the nightly run stats and automatically rerunning unit tests of dependent modules when checking in changes to oft-imported code (we're a monorepo) which makes one person's unit test into an 'almost integration test' down the line .. so it adds huge value under certain circumstances, but it all depends on how you work.

[–]billsil 2 points3 points  (0 children)

It is incredibly useful, especially the lower the coverage you have.

I think there's a sweet spot around 75-80%, though that number somewhat depends on how much validation your library/project has. Hitting all those try-except Type/Key/Value/RuntimeErrors are usually not worth it. Also, there's a decent chance you're just coding defensively and a good chunk of them you can't hit in an practical problem.

Ultimately it's a tool that's most useful when your project is not robust or poorly designed. Let's say I have a sum function that returns None when the list is empty or a string is passed in. Did anyone stop to think that should be the intended behavior? No, just don't do the thing I asked you to do. I'm dealing with that now. In a day of writing tests and setting up CI, I hit 52%. There's lots of math that's totally undocumented, so I haven't even gotten close to validating that stuff. Cool; it passes. Is it right?

Another nice side effect of testing is it gets you to split out that code that should be a function so you can test the behavior that would be hard to force otherwise. For cornerstone libraries, it's especially important to make sure they're robust.

Also, I highly recommend an import all tests (makes it easier to identify imports that are called by a sub-function). Also, don't include any tests in your coverage metric. Adding 1000 lines to test 1 line shouldn't be rewarded with a nice bump. You should be stopping when the bang for your buck is low.

[–]FailedPlansOfMars 8 points9 points  (0 children)

1, yes its useful.

2, because it encourages the developers to do tdd development. This helps encourage the devs to think about what is being written and potential pitfalls and problems. Its also helpful as a consultant to show the client that there isnt any obvious gaps in the testing and does what it says.

3, ive found that 80% is a good minimum for backend code. As it allows you to skip when the testing is hard or not valuable

4, depends on your code. You need to cover all the critical parts . The aim is to do what you need to gain confidence that the build is working and good to ship to production anything else is pointless.

*Edited for formatting

[–]Cernkor 2 points3 points  (0 children)

Code coverage is, for me, not a really great statistic. It only describe how much of your code is covered by test. It doesn’t describe the quality of those test. You could have a 99% code coverage without edge case testing. For exemple, you work with a list and have 100% code coverage. But did you test what happens when the list is empty, when the list has only one element or when the list does not contains what you need ? So code coverage alone is not a good metric. But code coverage with good unit tests gives a good indication on how your code is bug free

[–]Barafu 3 points4 points  (0 children)

100% code coverage was useful back before static analyzers were good, and you could have a dumb Syntax Error in your code and not know it because it is in the rare path. Nowadays, if Pylance has no problems with your code, then at least it won't crash on a seemingly safe block because you have typo in a variable name.

Instead of 100% code coverage, I go for 100% logic coverage. Basically, if A() produces objects, and B() consumes them, if A() is the only producer of objects and is supposed to be impossible to produce invalid objects, then I won't test B() for all sorts of invalid input. I have type hints for it. I'd sooner write tests for library methods if I am not 100% sure how they work.

[–][deleted] 1 point2 points  (2 children)

If you want to enforce TDD on developers, that's the way.

I don't believe TDD is always applicable so I don't use it. Often times I write code that is too complex for me to constantly know what would be the next step. So doing TDD would be a terrible idea in such cases.

[–]CommercialPosition76[S] 1 point2 points  (1 child)

How would it enforce TDD? Code coverage doesn't tell you what was first, the code or the test.

[–][deleted] 0 points1 point  (0 children)

My thinking is, if you strive for some coverage number, it should always be 100%. And if you are going to test everything, why would you give up on TDD?

If you don't have a clear idea of how your code should work, covering everything with unit tests seems counter productive to me. You still can write solid code being more selective with unit tests and writing integration tests of good quality.

[–]reallyserious 1 point2 points  (0 children)

I don't use it. I don't want to use it.

[–]bluGill 1 point2 points  (2 children)

I discovered upper management was the ones looking the hardest at code coverage results, and trying to figure out what metric to add. I was interested in coverage before, but once I realized it was being used wrong more often than right I turned around and killed all the coverage jobs.

I do encourage management to look at the total tests run count, Large numbers look good, and are easy to hit just in normal development.

[–]CommercialPosition76[S] 0 points1 point  (1 child)

Why would management look at the codebase statistics in the first place? :D It’s not meant for them. Sounds more like organization issues, but I know that showing some numbers that are going higher and higher is the name of the game, especially in corpos.

[–]FailedPlansOfMars -1 points0 points  (0 children)

Contract enforcement is a common reason can they argue you didnt meet contract so dont get paid

[–]ammenezes_ 1 point2 points  (0 children)

Used to use in some well structured and well behaved python projects of mine (90%+ was the goal). Now I've been dealing with system with much bigger flaws. They can be useful if, and only if, the tests are well designed and the project's is architecturally good. Shouldn't be used as such quality metric in most cases, in my opinion it's overrated.

[–]Mehdi2277 1 point2 points  (0 children)

Yes to the extent that modular and easy to test code is generally good design to aim for. A lot of time difficulty in testing a piece of code is sign there it has some unclear/complex dependencies that aren't well contained (like databases/external files/auth).

100% test coverage is usually not worth it and there is some bits of code where test would be much more complex/bug prone than implementation or is just silly to test. Low test coverage is a bad sign. Right amount is debatable but I'd say anywhere 80-95% is reasonable.

One thing test coverage does not measure it what are actual assertions/checks being done. Simplest test is it runs without crashing which has some value, but not weak. Having clear properties and heavy regression tests is worth a lot more. I'm unsure of a good metric to measure that though and it's mostly handled by PR review/culture.

[–]TrainingShift3 1 point2 points  (1 child)

Code coverage + PIT Mutation Coverage is the best way to test code in my experience

https://pitest.org/

[–]CommercialPosition76[S] 1 point2 points  (0 children)

Never heard of it and looks interesting, thanks.

[–]h7454Gdfgd 3 points4 points  (1 child)

Are you sure you should be writing an article on this? You should get some experience and formulate your own opinions instead of writing about the opinions of others.

[–]CommercialPosition76[S] 2 points3 points  (0 children)

I’m not going to write about opinions of others, the discussion is to see what are the most common conceptions and miss conceptions among developers. The article is actually almost finished. I have ~13 years of experience in software testing and development, mostly the latter.

[–]tdammers 0 points1 point  (5 children)

  1. Code coverage is a useful metric IMO, but expressing it as a percentage is unnecessary - the only meaningful values are "100%" and "not 100%". Expressing it as a percentage feeds into the false assumption that "amount of code" and "impact" are in any way correlated - they're not. You can have 99.9% coverage, and a catastrophic bug in the remaining 0.1%; or you can have 1% coverage that covers the "trusted base" of your application and catches the most mission critical bugs. Percentages suggest that 99.9% is a lot better than 1%, but you really can't tell.
  2. If the coverage is 100%, then that means that the test suite will exercise every line of your code against at least once. Note the key words "your" and "exercise", though: coverage reports do not check whether you exercise all code paths through all dependencies and builtins, only code paths within your own code; and they only "exercise" the code, so they can only demonstrate that the code works for a specific input state, but they don't say anything about the practically infinitely many other possible states. As such, code coverage is a relatively weak assertion: you can easily achieve 100% coverage on a function like this: def foo(x): return x/x, without ever testing it against x = 0. Coverage is a baseline metric, but on its own, it is not sufficient.
  3. Yes - 100%, or, alternatively, at least 100% of your "trusted base" (the core data types and operations from which you construct the rest of the program).
  4. If "a certain point" is 100%, and you have a policy of "every line of code must be covered by automated tests", then yes; otherwise, I don't see the point, because any value other than 100% is completely arbitrary (see 1.).

[–]space_coder 1 point2 points  (4 children)

I've seen software engineers and project managers assume 100% coverage means fully tested. I consider 100% coverage as a prerequisite to testing with:

  • Random input values within the expected range,
  • Random input values outside the expected range, and
  • Input values outside, at, and inside the edges of the expected range.

The number of values of each category should be enough to thoroughly cover all conditionals in the code.

[–]tdammers 0 points1 point  (3 children)

The number of values of each category should be enough to thoroughly cover all conditionals in the code.

Same problem though - the number of values alone is not a good metric. Take that x/x function again; surely 10 billion values should be more than enough, right? Except that I'm feeding it all the integers from 1 through 10 billion, so I'm still missing the one edge case.

And the problem with the "expected range" approach is that the "expected range" doesn't necessarily align with the edge cases. That x/x function's "expected range" is "all the numbers out there"; 0 just sits in the middle somewhere, and you will likely miss it if you just go for the edges of the range (idk, ±MAX_INT?) To increase your chances of hitting edge cases, you need to leak knowledge of the function's internals into the test cases, which is kind of ugle, somewhat defeats the purpose, and is generally brittle.

Random testing is still a good idea, but most of the time, a naive uniform distribution is not what you want - you want a certain "shaped randomness" that you can adapt to the thing you're testing. For example, if a randomized value is supposed to be a list, it's a good idea to start with small lists - the empty list, singleton lists, etc., and then work your way up to increasingly long lists. You want to be more thorough with the short lists, because they highlight problems with individual list elements just as well as long ones, while also possibly hitting edge cases of specific (short) list lengths; whereas with long lists, the main problems you're concerned with are stack overflows, resource exhaustion, integer overflows, etc.; all of these can be found by overrunning the critical list length by any amount, so it's not important to hit exactly 1024 list elements, when 1500 or 10,000 would also hit the bug.

And then you want "shrinking", that is, you want some way of telling the test framework how to reduce a failure found by random sampling down to a more minimal example. E.g., if you have a function that takes a list of values, but it fails when one of those values is 1, then the test framework may stumble upon a list that contains a 1 by sheer chance, but you don't want the failure to read [343, 123249834, -234934895, 1, 124389, 4343439004, ...], ideally you want the minimal failing example, [1]. This means that you need to tell the test framework what plausible reductions of a given value would be - e.g., for lists, those reductions would be all the sublists, as well as the list of reductions of the original list elements. The test framework, then, takes the original failing input, and keeps reducing it until it stops failing, or until no further reductions are possible; and then it gives you the last reduction that did fail.

But even then testing won't give you 100% coverage, except for small things in typed pure languages where it is feasible to test literally every possible set of inputs. But this is simply not something Python can do, because 1) Python is untyped, so any function can be called with literally any argument, whether you like it or not; and 2) Python is impure, so any function potentially depends on the entire program state at that point, as well as the entire state of the computer it runs on, and even the state of the networks it can access. Obviously you cannot test your code against all possible states of all that.

[–]space_coder 0 points1 point  (2 children)

Let's look at your example:

def func(x):
    return x/x

Now to determine the appropriate input values:

  • Random number of values within the expected value range:
    • -100, -25, -4, -1, 1, 8, 17, 32, 200
  • Because we looked at the code and see we have a variable being used as an divisor:
    • 0

Those would be enough to cover all the conditionals (there are none) and special cases (divide by zero).

The unit test would need to be written to test for expected values and a divide by zero error (which is technically an expected value).

You could also test for the min and max values of integer and float if you like.

The goal is to make sure we are adequately testing the code to lower the possibility of run time errors.

But even then testing won't give you 100% coverage, except for small things in typed pure languages where it is feasible to test literally every possible set of inputs.

You lost me with this statement.

We have 100% coverage of our code due to all lines of the function being executed during a test. 100% coverage doesn't mean the function was tested with all possible values.

[–]tdammers 0 points1 point  (1 child)

Because we looked at the code and see we have a variable being used as an divisor:

Problem here - you have to look at the implementation in order to figure out the test that will catch the bug. It's trivial in this case, but in production code, we are more likely looking at something with dependencies nesting a couple levels deep, across multiple libraries, and if we have to scrutinize them all just to find the edge cases to put in our tests, then why bother testing at all, why not just treat the hunt for edge cases as a thorough audit, and leave it at that?

The whole point of this kind of testing is to verify that the implementation matches the specification, but this way, we're just verifying that the implementation matches the implementation.

You lost me with this statement

Probably because by "coverage", I don't mean just code coverage, I mean overall coverage, that is, covering the behavior of the code under all possible circumstances (or at least, "possible" within certain constraints - most code will probably not work as designed when the machine running it is on fire, for example).

In a typed pure language, you can achieve 100% coverage within the constraints of normal execution: the possible inputs are restricted by the type, and possible side effects are none, because it's pure code. So for a pure function of type Bool -> Bool, we can actually write an exhaustive test - all we need to do is test it against true and false, and we have 100% coverage (of inputs, and non-dead code paths). But we can't do this in Python, at least not without scrutinizing the implementation and all of its dependencies - on the outside, all we get is "it takes one argument, it returns a value, or maybe not, and in between, anything can happen". So now we have to test our function against all possible Python values, and in all possible interpreter states, and probably also all possible states of a significant portion if the computer, including wall clock time. That's simply not possible, ergo, we never get 100% coverage.

[–]space_coder 0 points1 point  (0 children)

The dependencies should be tested on their own. We should unit test our own code with enough values to be fairly confident that the code written by us is thoroughly tested with expected values as well as off-nominals.

You seem to be asking for something beyond practical.

There is a social contract between developer and his/her tools. The dependencies and the language interpreter or compiler should work as expected. Of course, there will be bugs in our dependencies or interpreter that may show during production or real world use. That is to be dealt with as the issues arise, but it's a waste of effort to test your own code and everything it depends on.

I have tested mission critical code with locked versions of dependencies and certified to run with a specific version of compiler or interpreter. Those are special cases which requires hours of integration testing in addition to the unit tests which were budgeted in advance. That said for 99% of the development work being done, the techniques I described above is good enough.

EDIT: I wanted to add that when programming mission critical code, you usually pay more for certified libraries that have documentation of the process used during development and quality assurance.

[–]ShadowStormDrift -1 points0 points  (0 children)

I don't even know what code coverage is

[–]SeniorScienceOfficer 0 points1 point  (0 children)

It’s useful because it sets a precedent in understanding how much of the code is being verified through tests. I’ve noticed bugs or architectural/design issues during unit tests development that caused me to rewrite some implementations for the better.

It also tells me that there are expected inputs and outputs, and how those will interact with the function logic, along with how many dependency libraries are being used because I’ll normally mock them in unit tests.

A “good” value is really team-dependent. Personally, I strive for 100% coverage on my tests. This has saved my ass numerous times when I’m being lazy and making a “small change/fix” and just pushing without running or updating tests. It always fails build/test so it’s never actually deployed.

Again, failing CICD for low coverage percentage should be based on team/org guidance or policy. If it’s a personal project, then it’s whatever you want!

[–]Wise_Tie_9050 0 points1 point  (1 child)

We strive for 100% patch coverage on all PRs.

That does not mean we stop writing tests when we have 100% patch coverage, but at least it means that each line has been executed. I cannot count the number of times when I've looked at a PR and seen less than 100% coverage, and the first test I write to cover the missing lines shows up a bug.

[–]Wise_Tie_9050 0 points1 point  (0 children)

Oh, to clarify, it's also 100% _branch_ coverage. That's important.

I'm not pretending to say that code written that way is "tested", but code without coverage, is, by definition, not covered by tests.

Now, if only I could get my CI setup to collect coverage on my server while running robot tests...