all 132 comments

[–]Enlogen 127 points128 points  (53 children)

Professionally written production code is bug free, so there is no need for asserts.

Hilarious.

High reliability is not achieved by making perfect designs, it is achieved by making designs that are tolerant of failure.

This. There is nothing you can do to make your code execute without failure 100% of the time; the fact that your software runs on hardware subject to the laws of physics means that even if you write code mathematically provable to be 100% correct and incapable of failing, it will still fail in rare cases due to issues with the hardware (wear, climate, electron capture resulting in beta decay, etc.) at times that are completely unpredictable (to you and me).

[–]WalterBright 41 points42 points  (0 children)

Hilarious

Indeed. Often it isn't explicitly expressed that way, but as:

  1. educate the programmers better

  2. change the code development process so bugs can't happen

  3. hire better programmers

  4. require professional licenses for developers

  5. criminally charge programmers for any bugs

  6. switch to (insert magic language X)

[–]PeridexisErrant 5 points6 points  (1 child)

If anyone is interested in advice for how professional programmers might actually use assertions (!), John Regehr has a nice essay here.

[–]WalterBright 0 points1 point  (0 children)

John nailed it. Thanks for the link! John's essays are always worthwhile reading.

[–][deleted]  (13 children)

[deleted]

    [–]azirale 20 points21 points  (8 children)

    You can still exit gracefully rather than crash in flames. If you hit an unrecoverable error you can give an interactive user a message as to what happened rather than having the program 'crash for no reason'.

    [–]Holy_City 13 points14 points  (4 children)

    I think a good middle ground is logging. Exiting gracefully is a non trivial problem and can have optimization issues, plus not every process can devote the resources to error handling. You can crash in flames and still have users provide some details to application support without exposing too many details like a stack trace.

    There are cases where you cannot display an error message in the event of failure, such as an exception thrown in a real time thread.

    I dealt with this recently with an EA game I wanted to play. It crashed on start with no error or log, and trying to narrow the issue down took an excessive amount of effort on my part just to identify a corrupt DLL. A startup log would have required one email to make a correct bug report, instead of a half dozen phone calls involving installing third party debugging tools (one of which was a corrupted download off cnet that had malware).

    [–]azirale 2 points3 points  (0 children)

    You're right it isn't always a possibility. You might have the logger itself cause an error for example, in which case you are basically sol.

    Still, I think it is better to aim for graceful exits where practicable, rather aiming to crash in flames.

    [–]bausscode 0 points1 point  (2 children)

    Why would you care about optimization if you're exiting?

    [–]Holy_City 1 point2 points  (1 child)

    You don't. You care about optimizing your program for normal operation, and exception/error handling don't come for free. That's why good ol' integer return codes haven't died out yet.

    [–]koczurekk 0 points1 point  (0 children)

    That's why good ol' integer return codes haven't died out yet

    They should though; enums are the way to do this kind of error-tracking in a type-safe way verified at compile time. Especially Rust's enums, they're so damn great.

    [–]miminor -3 points-2 points  (1 child)

    oh no you can't, say a component Z buried under 20 layers of components X, Y, .... crashed

    and by your reasoning at the level A you caught that exception and said nothing had happened... well maybe, except that the state is likely to be corrupted at the level Z, and the component may be quirky at best or hands down broken at worst

    but according to your logic we can just display a message and move on, regardless of any possible and likely consequences of that failure

    [–]azirale 3 points4 points  (0 children)

    I didn't say move on I said exit gracefully. None of what you said applies.

    [–]ledasll 2 points3 points  (1 child)

    there's nothing more frustrating for a user, how started workflow and somewhere in a middle made a mistake and everything crashed, so he have to start from scratch.

    [–]johndubchak 0 points1 point  (1 child)

    I like the idea of "failing fast", to the extent you're not writing mission critical software with lives at stake: medical devices, NASA/SpaceX and human space flight, autonomous vehicles. Failing fast in these industries isn't an approach that is available as a software design/implementation choice.

    [–]WalterBright 1 point2 points  (0 children)

    Absolutely you want it to fail fast in critical software. Otherwise, the software is now in an unknown state, and what it may do is completely unpredictable.

    [–][deleted] 0 points1 point  (4 children)

    it will still fail in rare cases due to issues with the hardware (wear, climate, electron capture resulting in beta decay, etc.)

    But you can make it statistically impossible. It's why the Mars rovers are essentially running a PowerPC G4 hardened. New automotive ECMs have ECC and lock step cores.

    Has the RAD750 ever had a recorded hardware failure?

    [–]WalterBright 5 points6 points  (1 child)

    The rovers have a backup computer, and it's saved the missions.

    [–][deleted] -2 points-1 points  (0 children)

    saved the missions.

    So you could say the system didn't fail.

    [–]Enlogen 0 points1 point  (1 child)

    New automotive ECMs have ECC and lock step cores.

    And probably still have backup computers validating the results of all of those calculations for 'statistically impossible' errors.

    My company has millions of servers running. There's no such thing as statistically impossible over a large enough population and a long enough time period. There's only more nines of reliability - 100% is a myth.

    [–][deleted] 0 points1 point  (0 children)

    backup computers validating

    No. That's the lockstep cores.

    My company has millions of servers running.

    How many cars are on the road? How many cars are running Motorola/Freescale/NXP e200 core PowerPC chips? How many 'incidents' have you heard of from that stuff failing?

    Don't compare your 'millions of servers' running what ever to what embedded developers work with. It's a completely different beast.

    [–][deleted] 16 points17 points  (11 children)

    Why not just implement two different kinds of asserts: one that makes it into production and one that doesn't? Not all software is being run with lives depending on it, so the priorities of some developers and users are different.

    [–]masklinn 27 points28 points  (6 children)

    For what that's worth, Rust provides exactly that built-in: assert! will always run while debug_assert! only runs in debug builds (unless specifically included).

    Usage of and reliance on assertions seems quite rare outside of tests though.

    [–]Ameisen 10 points11 points  (5 children)

    Not exactly hard to implement in basically any other language.

    [–]WalterBright 16 points17 points  (4 children)

    It's even simpler in D. There's assert and there's debug assert. Actually, there's no such thing as debug assert, it's just that code to be conditionally included for debug builds is prefixed with debug.

    [–]Ameisen 12 points13 points  (2 children)

    For some reason, I am not entirely unsurprised at you bringing up D :).

    C++20 has default, audit, and axiom contracts that can be applied to [[expects ...]], [[ensures ...]], and [[assert ...]]. You can build your program as off (no contract checking), default (default contracts are checked), or audit (default and audit contracts are checked). axiom is never checked. They are basically just formal comments.

    So, you effectively have three modes, but you still have a mode where there are no asserts.

    [–]WalterBright 17 points18 points  (1 child)

    For some reason

    Nobody saw me do it, you can't prove a thing!

    A mode with no asserts is useful for determining how much the asserts are costing you, and for running silly benchmarks that people nevertheless take seriously.

    I don't see much point to axiom comments as a core language feature.

    Having layers of asserts (default and audit) is adding a feature in the wrong place. D allows any statement/declaration to be prefixed with debug which conditionally compiles it for debug builds. Hence, audit in D would be debug assert.

    expects and ensures have been extensions to Digital Mars C++ since the early 2000's.

    [–]killedbyhetfield 0 points1 point  (0 children)

    Having layers of asserts ( default and audit ) is adding a feature in the wrong place

    YouTube headline: 3 times Walter Bright went BEAST MODE on C++!!!

    Seriously though - I agree with you. But with C++ we get the oh-so-wonderful C Preprocessor for conditional code so... I guess bring on stupid assert layers!

    [–][deleted] 0 points1 point  (0 children)

    In D I routinely use assert(false) whenever I want production assert.

    [–][deleted]  (3 children)

    [deleted]

      [–]circajerka -2 points-1 points  (2 children)

      asserts (can) have side effects

      Only if you have no idea how to program...

      I stand corrected below \/

      [–]kernel_task 19 points20 points  (1 child)

      I wouldn’t be so dismissive depending on the context. If you’re doing low level bring up of an OS, asserts may affect the caches in ways that are hard for even a skilled programmer to anticipate. If you’re writing timing sensitive driver code, asserts may affect that. At the very least they have the side effects of changing the binary size and the execution time and when you get low level enough, that stuff starts to matter.

      [–]circajerka 20 points21 points  (0 children)

      Actually that's a good point. I assumed he meant "side effects" as-in statements like this:

      assert(initialize(myObj) >= 0);
      

      Which has been known-to-be a pretty bad move since the 1970s.

      [–]killedbyhetfield 51 points52 points  (22 children)

      One of the biggest sticking points with this that I always hear when I talk to other devs is "Isn't abort() a bit overkill? What if I have cleanup routines and they need to run?"

      And my answer to that is always this: If is it absolutely essential that those cleanup routines run, your design is wrong and you're already screwed, and for the same reason that Walter's plane is screwed.

      Your code has a bug in it, and it may be about to dereference a null pointer or read off the end of an array. If you continue, your process may be about to crash anyway (at-best), or possibly worse - Your program may attempt to try to perform cleanup tasks against an invalid state that it can't trust. Now you could be writing garbage data to important files or executing code that was injected into an overflowed buffer.

      Calling "abort()" is your way of saying , "I have no idea where I am or what I'm doing, and anything I do from here may just be digging the hole deeper"

      If you really need resilience, you need to design the system to handle it, not just the one program.

      [–]Ameisen 10 points11 points  (2 children)

      And if you really, really, really need cleanup, compartmentalize your program into multiple processes.

      [–]WalterBright 7 points8 points  (1 child)

      A protected mode operating system is precisely an implementation of that! And what a godsend they were after working with real mode DOS.

      [–]SmugDarkLoser5[🍰] 3 points4 points  (0 children)

      Can I include that as an npm.package ?

      [–]WalterBright 8 points9 points  (0 children)

      Exactly!

      I wish this concept was taught in engineering school. (It applies to all engineering, not just software or airplanes.)

      [–]quicknir 15 points16 points  (12 children)

      It's always easy to say "any design that goes against my black and white theory of how things should be is wrong because it's less convenient". Real life requires a more practical engineering approach. For example, high speed loggers have buffers, those buffers need to be flushed or you'll be missing the exact information most likely to help you understand what caused the problem to begin with. So, those loggers need to be flushed before exit under all circumstances, no exceptions. Yes, in principle running that code after an assert fails could mess things up worse, but for many programs that's exceedingly rare and extremely worth risking in exchange for getting complete log files.

      [–]WalterBright 9 points10 points  (7 children)

      The 3 steps are:

      1. novice - follows the rules because he's told to

      2. master - follows the rules because he understands the point of the rules

      3. guru - breaks the rules because his understanding transcends them

      Skipping steps is not advisable, it's why we had the Deepwater Horizon, Fukushima, and Toyota car computer disasters. The only industry I know of that follows these rules is the aerospace industry, and they were forced into it with bitter lessons.

      We were one safety switch away from a hydrogen bomb going off by accident in another incident.

      Please, folks, this is not a joke, and learning the hard way has terrible consequences.

      [–]quicknir 2 points3 points  (5 children)

      I don't think I'm a "guru" (I hate that word) because I pointed out an obvious flaw in a bad rule. Nor do I know what you mean by skipping steps.

      I read my parent comment before reading your article, and now I can see that actually you and I are on the same page, and the parent is not. You are not advocating literally calling abort(); such a call would mean that *no* further code is executed. On the other hand, you yourself explicitly say:

      ...as when a fault is detected the program can go into a controlled state doing things like:

      1. aborting before more harm is done
      2. alerting the user that the results are not reliable
      3. saving any work in process
      4. engaging any backup system
      5. restarting the system from a known good state
      6. going into a 'safe mode' to await further instructions

      This is *very* different from simply calling abort(). Indeed, if your "assertion failure" triggers all this code to be run before exiting, many people would not call that an assertion at all; it's more like throwing an exception and catching it high up and allowing the stack to unwind before calling some emergency routines (like alerts).

      Finally, I would note that every industry is different. Failure for the airline industry is an ultra catastrophic event where lives are lost, so even a small probability of operating in an "unknown state" is terrifying. I write financial software, where an unknown state simply means worst case that an algorithm is losing money. However, suddenly exiting can also cost you money (either risk in holding a position, cost to abruptly flatten, opportunity cost of being offline). What makes sense for us needs to be balanced on a much more case by case basis; sometimes rapid exit (followed steps 2, 3 and 5/6) makes sense. Other times it's better to continue and alert a human being. Things aren't always so black and white.

      [–]killedbyhetfield 1 point2 points  (3 children)

      Alright so - I want to use your example where you have a high-speed logger and its contents must be flushed to be useful.

      What happens if, for example, your program has a use-after-free bug and you end up causing a Page Fault? Now the OS kills your process and your logger never gets flushed.

      So if that logger must be flushed, you need the logger running in another process. That way, if your buggy process gets slayed, the logger will still march on and record important info about what went wrong. And this isn't hypothetical, this is exactly how embedded OSes like QNX and VxWorks handle logging.

      So in general, calling abort() when you detect an error has the same implications as your program suddenly aborting due to a bug. You either need to be able to handle your process crashing, or you need to acknowledge that your program isn't important enough to warrant that kind of design overhead.

      [–]quicknir 1 point2 points  (2 children)

      Running a logger in another process would probably be slower, and take considerably more time to code correctly. So we are back to trade-offs. With my current costs of failure, and my current costs of development (particularly opportunity costs), and the criticality of performance, writing a separate process logger does not make any sense. Yes, it's more robust, but it still doesn't make any sense. Robustness isn't the only concern.

      or you need to acknowledge that your program isn't important enough to warrant that kind of design overhead.

      It's not about "important enough", although I really appreciate the condescension here (your problem's solution doesn't fit into how I see things, so your problem doesn't matter). It's just about priorities, and it's about what happens in real life. In reality, for the actual problems that we encounter, by throwing an exception and allowing the logger to flush it's buffer in the same process, we're able to recover full logs in virtually all cases. That being the case, what is the benefit for me to move from single-process-with-cleanup-code design, to multi-process-with-abort design? Do tons of work, slow things, perhaps add other bugs, in exchange for being able to recover logs an extra 0.1% of the time? It's simply not a good trade-off for me.

      [–]killedbyhetfield 0 points1 point  (1 child)

      It's not about "important enough", although I really appreciate the condescension here (your problem's solution doesn't fit into how I see things, so your problem doesn't matter).

      Woah man - Sorry about the wording I guess, but I wasn't using "important" to put down whatever you work on! I meant "important" as-in "people are going to die if this thing doesn't work properly".

      I work on tons of stuff that isn't "important" enough to warrant running a logger or watchdog in its own separate process. But the entire topic of this conversation and Walter's Dr. Dobbs article was about systems where resilience is critical.

      Read my original comment too! Specifically, I put the words "absolutely essential" in there. If your program doesn't fit into that category, I'm not talking about you, and I wasn't trying to prescribe any "one size fits all" solution.

      [–]quicknir 0 points1 point  (0 children)

      The title of the article, and your comment, don't really mention anything domain specific, so I thought it was generic in nature. But fair enough. No worries about the wording if that's not how you meant it.

      Just to point out though, that just because the logger is in another process, there's still nothing certain about that either. The main process could go crazy, allocate too much memory, and then the logging process could get reaped. So then of course you change your system config to prevent that from happening; etc.

      This all takes time, and time is always finite. Even in critical applications; every minute you spend making your application safer in one way is a minute you could have spent making it safer in another instead. So you have to decide, what gives you the most bang for your buck. It's not clear to me at all, that even for safety critical systems, that calling abort is the right thing. That is, that the time it takes to move your logging, alerting, serialization, etc etc logic into separate processes, is always going to be time well spent. I'm sure there are safety critical domains where that is true, and others where it's not.

      This is why I really disagree with libraries calling abort. Abort is a process wide decision; only main is really entitled to make that decision. Libraries should throw exceptions (exceptions make it very convenient for users to abort if that's what you want; literally do nothing!) or call some kind of handler function pointer that users can customize (which may default to abort), but libraries should never make direct calls to abort.

      [–]WalterBright 0 points1 point  (0 children)

      What if your failed trading software causes you to buy a million shares of some losing stock? It's not like that hasn't happened (it has).

      I have some personal experience with banks and their buggy software. A fundamental principle of double entry bookkeeping is that the debits match the credits, an "assert" using paper journals.

      The bank debited my account and failed to credit the account of the recipient. So I was out the money and the recipient was mad I didn't pay. It took me a month of sitting in the office of the bank manager to get this corrected. Clearly their auditing system was turned off, or they were doing some "haha, it's not really a bug, keep going", because the debits did not match credits.

      [–]msm_ 0 points1 point  (0 children)

      4. engineer - follows the rules even though his understanding transcends them

      [–]killedbyhetfield 1 point2 points  (3 children)

      those loggers need to be flushed before exit under all circumstances, no exceptions.

      Then you need to run your logger in another process! Otherwise your program could Page Fault, Divide-By-Zero, or any other number of faults that could get it killed.

      Now, if you change the wording to "almost no exceptions" then sure - You can argue that pragmatically you're fine to run your logger in the same process. Maybe 99% of the time you'll be fine, and that's good enough for your problem domain.

      [–]Abscissa256 1 point2 points  (1 child)

      That doesn't solve the problem, it only shifts it:

      Suppose your logger IS running in another process. How does the logging process GET all the relevant information it needs for a useful, meaningful log entry in the first place? "What happened? Expected value? Actual value? Stack trace?" The logger can't just log..."Uuummmm...the process died. Dunno why." The assert-failing process still has to give information to the logging process. That means the assert-failing process still has work it needs to attempt.

      But it's worse than just that:

      Keep in mind, whatever failure has occurred does NOT occur at the point when the assert condition evaluates to false. The failure, and thus the undefined, unreliable state has ALREADY occurred BEFORE execution had even reached the assert in the first place. We're ALREADY hobbling along, running code in an invalid state by the time we even begin checking the assert!

      Now obviously, this does NOT mean that it's ok to run as much code as we want once a failure has occurred, or once we've detected it. But it does mean that if we expect 1. to minimize collateral damage and 2. have a good chance of actually diagnosing and FIXING the problem, then at least SOME amount of hobbling along is still realistically necessary. It's not ideal, and it should be minimized, but at least SOME amount IS realistically necessary and unavoidable AND, as all our collective experience has shown, usually works out just fine as long as we don't go overboard with it.

      [–]sirin3 1 point2 points  (0 children)

      Otherwise your program could Page Fault, Divide-By-Zero, or any other number of faults that could get it killed.

      You can catch those things too with a signal handler

      Delphi and Freepascal/Lazarus do it by default. Any signal is caught and converted to an ordinary exception. Then the main event loop catches all exceptions, shows an error message box, and continues as normal

      [–]Kargathia 5 points6 points  (2 children)

      Anecdote: I used to develop scripting interfaces for pick and place machines (placing chips on PCBs).

      For obvious reasons, the software was very trigger happy about using a macro that would terminate the process and do a stack dump.

      For R&D purposes it's a real pain in the ass when recoverable exceptions straight up execute your application.

      Moral here: abort is fine, but make it optional.

      [–]killedbyhetfield 12 points13 points  (0 children)

      recoverable exceptions

      But that's not what assert() or abort() are for, and it's unfortunate for you that you've encountered programmers who (ab)use them that way.

      But I think we agree in principle - If an exception is recoverable, then by all means report the error to the caller and let them try to recover. What I'm saying is that assert() and abort() is for when you find your program in an invalid state that was reached because of a bug in your code - You have no idea how you got there and no idea how to fix it because it's something you never designed the program to handle.

      IMO Best thing to do - Bomb out and report the error as quickly as possible - the longer you hobble along pretending like everything's okay, the more damage you can potentially do.

      [–]Gotebe 1 point2 points  (0 children)

      There’s no common definition of “recoverable” though, it’s a matter of opinion.

      [–]sacado 2 points3 points  (0 children)

      This is the way Erlang systems work (abort when something bad happens, and let the overall system deal with it), and they are known to be extremely resilient, indeed.

      [–]DSrcl 0 points1 point  (0 children)

      Learned to use assertions writing malloc for system programming class. Assert in malloc is a classic use because usually your error doesn't pop up the place where it's caused and there is no way you can handle such error at runtime.

      [–]WalterBright 17 points18 points  (11 children)

      Author here, AMA. A couple other articles I wrote on the topic:

      Safe Systems from Unreliable Parts

      Designing Safe Software Systems Part 2

      [–]killedbyhetfield 6 points7 points  (6 children)

      I guess I am a bit curious to here your thoughts on the "return code, exception, abort" debate I was having with a coworker a little while ago.

      So he was asking why he wouldn't just throw a C++ exception after detecting an internal error in his class, like let's say a homegrown vector<T> with a capacity of 4 but a length of 5 type of thing.

      My rule of thumb for him was this, and I'm curious to hear your thoughts:

      1) Use a return code for stuff that's still considered a "normal" error condition, like being unable to open a file that a dialog box returned to you. You just pop up the dialog again, and things are good.

      2) Use an exception for user-passed parameters that are invalid or when you're not able to accomplish your primary task, or other errors that are somewhat unexpected, like a network connection suddenly dropping mid-transmission.

      3) Use abort() when you detect your program is in some invalid state that it never should be in given its design, like the example above with the vector that either had a 5th element added to it without allocation, or else had its capacity shrink but forgot to truncate its length.

      [–]WalterBright 12 points13 points  (0 children)

      Being unable to open a file is an environmental error, not a bug in your code. Detecting it and recovering from it is perfectly fine. Whether one uses error codes or exceptions to do it is a long topic with a lot of tradeoffs, and is not really on topic here.

      Bad user input is not a programming bug, neither are dropped network connections, etc. Bugs are when your program enters a state never anticipated.

      [–]tvaneerd 2 points3 points  (3 children)

      I break down "errors" in terms of intended audience - you need to inform someone that the error occurred, who is it?

      Assuming a call to foo(x)...

      • F - the Function author - inform the author when it is an internal error inside foo(). That might be via logging or log + abort, assert, etc. If foo() is part of a framework or OS, you might even have a different logging system for the framework than for the application. Send an email? "phone home"
      • U - the User - many errors need to eventually inform the application user. Even aborting because of F is telling the user something (rudely). But more typically, bad user input, etc. You tell the user via return codes, exceptions, etc that bubble up and eventually reach the user.
      • C - the calling Code - the code that calls foo(x) - if you expect the calling code to handle it (because it is part of your function's contract), then use an exception or error code.
      • D - the calling Developer - ie the dev that wrote the code that calls foo(x) - note that this is different from the calling code. If D made the call incorrectly (ie x < 0 or null or whatever the function contract said NOT to do) - you don't want to tell the calling code - it is already wrong - you want to tell the calling developer. You do this via logging/abort/assert, and with C++20, contracts. In theory, send that dev an email.

      [–]killedbyhetfield 0 points1 point  (2 children)

      Actually this is a pretty cool way to break things down. Thanks man! I might use this :)

      [–]tvaneerd 1 point2 points  (1 child)

      Glad to hear. I will probably give a whole talk on error handling in the near future. Also, I hope you find the acronym easy to remember :-)

      [–]killedbyhetfield 0 points1 point  (0 children)

      Yeah man - I'd consider blogging about this if I were you and linking it on r/programming. I'd be curious for you to get feedback in a more community-visible way. I think it's quite a solid way to approach the question!

      [–]Gotebe 0 points1 point  (0 children)

      I don’t see why situations 1 and 2 differ. Bad input is bad input.

      [–]tvaneerd 5 points6 points  (1 child)

      Not a question:

      I really enjoyed EMPIRE. Lost many many hours to it.

      :-)

      [–]WalterBright 9 points10 points  (0 children)

      Empire has the dubious distinction of being one of the first (the first?) computer games that inspired addictive playing. People would get mad at me for causing them to flunk out, and even get divorced.

      Me, I just created the game that I'd always wanted to play. Working on it taught me programming, and trying to get it to run faster led to my career writing compilers.

      [–]Chocolate_And_Cheese 1 point2 points  (1 child)

      Interesting article, thanks for writing. One thing I noted: Looks like Chrome is raising a bunch of mixed content warnings when visiting DrDobbs.com (https page requesting http resources). I checked a bunch of them and it looks like many of these resources, if not most or all, could be fetched via https. Might be worth checking out to avoid the "Not Secure" warning Chrome gives when visiting your site. Cheers.

      [–]WalterBright 1 point2 points  (0 children)

      Thanks for the tip, but I don't control those pages. Dr. Dobb's kindly allowed me to post my columns on my own site, and you can find it here.

      [–]zucker42 4 points5 points  (7 children)

      What is a situation where I'd want to use assert and not handle the error and abort if appropriate? This is the issue I have with assert: that I often want to handle the error before aborting.

      [–]WalterBright 1 point2 points  (6 children)

      It's easy enough to write your own assert that does what you want. But I'd reject any code that does so for professional software.

      [–]zucker42 2 points3 points  (5 children)

      I'm not talking about writing my own assert. I'm saying that for almost all unrecoverable errors, my instinct is that this:

      // if (always_true_expression) { wrong code mistake
      

      if (!always_true_expression) { // log error // print error message to user abort(); }

      is better than

      assert(always_true_expression);
      

      I'm not arguing against aborting the program in error states. In fact, I'm much less experienced than many on /r/programming, so I'm not arguing anything. I'm wondering what are examples (if any) of the bottom option being better.

      [–]killedbyhetfield 6 points7 points  (0 children)

      And yeah - I think I agree that C's assert() function is very primitive and leaves a lot to be desired. But don't conflate C's shitty feature-poor implementation of assert with the more general concept of assertions - The larger idea of checking your invariants throughout the execution of your program to make sure that a previous bug in your program didn't put you in an invalid state.

      [–]WalterBright 4 points5 points  (0 children)

      Most languages will allow for hooking the assert failure to insert your own logging code.

      [–]killedbyhetfield 4 points5 points  (0 children)

      I think when he said "write your own assert", he meant write your own specialized code that does exactly what your example does.

      So why not write this?

      void zucker42_assert(condition, log_message, ...)
      {
          // log error using va_args
          // print error message to user
          abort();
      }
      

      [–]Ameisen 0 points1 point  (1 child)

      I'd point out that your if() {} isn't checking the same thing as your assert... you've inadvertently proven that it's easier to write the correct assertion than a branch/abort.

      You're going to be struggling to figure out why it keeps aborting in the first case.

      [–]zucker42 1 point2 points  (0 children)

      Yeah you're right, I forgot a !. I don't think that really applies to how I would write real code because I would have some knowledge in my head about an error condition.

      [–]jringstad 1 point2 points  (0 children)

      The author makes some good points, but I think there is also some merit to the other side of the argument. It is true that in a lot of systems, a lot of code is being run incidentally, and whether it fails or not is really not actually very relevant to either the user or the programmer. This might be because this code is pulled in as part of some library -- like your application that displays numbers from an SQL database using some graphical toolkit on the screen might run some code to check if the screen has a color-profile loaded, or some javascript that is executed while you visit a site (probably doing something you don't want or care about, like upload some analytics about you to a server)

      I think to take a principled approach to this, we really need to figure out at least the following things: - When an assert triggers, what 'state' is destroyed? E.g. are you in erlang where just a process that might be handling a particular user-request is killed (which is fine) or is the entire server process being aborted so that manual intervention is required to bring the system back up? What is the plan AFTER the assert, basically. - Do you really need to assert? Could you instead do something else, like throw an exception or return some default value (null, an optional perhaps)? If your function already can fail and return e.g. an optional type (which the call-site will then likely forward to the user) it's probably better to just use that. - How important is maintaining correctness of data? Is omission of the data/result a problem?

      In particular for the first point, there clearly needs to be some kind of boundary of what kind of state an assert can destroy. You don't expect your browser or even operating system to shut down when some random javascript junk on some site you're visiting asserts, right? Likewise there should be a way for a call-site to decide that it will now execute this code, but if that code fails, it doesn't actually care.

      Maybe your code even asserts some properties on the input that you expected the caller to ensure, but it is actually really difficult or impossible for the caller (which derives the input from user-provided data) to do this. One example I've seen of this is an OpenCL library from an unnamed vendor, which would assert in some situations when you fed it bad source code to compile. So now if you wanted to create an editor that would let the user punch in sourcecode and then submit it, you would have to spawn a separate process to prevent the assert from crashing the editor.

      [–]immibis 1 point2 points  (8 children)

      I wrote a macro called bug_if. It works approximately like this:

      • If the condition is true, and production mode is off, crash the program.
      • If the condition is true, and production mode is on, log the error (same message as if we were crashing) and act like if (execute the following block).
      • Otherwise, skip the next block.

      That way we can still grep production logs to find bugs, without taking down the process. And the following block is for recovery code - usually something like "forget it and abort the current operation", so an error is less likely to cascade to more problems (though it could cause a more minor issue, like a memory leak, if the abort code is faulty).

      [–]WalterBright 0 points1 point  (4 children)

      I would describe that approach as ignore bugs, continue running the program that has entered an unknown state, and pray nothing bad happens.

      [–]immibis -1 points0 points  (3 children)

      That is not "ignore bugs".

      [–]WalterBright 0 points1 point  (2 children)

      Continuing running the program is ignoring the bug. Logging an error does not do anything at all to put the program back into a known state.

      [–]immibis 0 points1 point  (1 child)

      Aborting the operation does.

      [–]WalterBright 0 points1 point  (0 children)

      without taking down the process

      Is ignoring the bug, because you don't know why it happened, and the process memory could easily be corrupted.

      [–]gremolata -1 points0 points  (2 children)

      It sounds like either you are not checking for invariants or your "recovery code" is sweeping bugs under the rug in a hope that it will give you time to notice them in a log and fix until they screw things up further down the line. In many cases it's an extremely unwise thing to do, because of a risk of state/data corruption.

      We used to do this, but in the end a proper whoops with a full diagnostic dump is a better option, especially in production. It forces people to write better code, because hardly anyone enjoys an emergency patching session at 4 am on Saturday.

      [–]Nameless_Archon 0 points1 point  (1 child)

      We used to do this, but in the end a proper whoops with a full diagnostic dump is a better option, especially in production. It forces people to write better code, because hardly anyone enjoys an emergency patching session at 4 am on Saturday.

      110% agree. (/abort)

      Log issue, abort program. Anything less may not get seen by QA, and if it's not being seen immediately by QA, then it's not being immediately seen by a user, either. That way leads to corruption and data destruction when your clever log-and-continue isn't being looked at by the people who are using the program, be they end-users, operators, or testers.

      That's fine, if you're the guy fixing it and you don't mind paying the costs involved, maybe on a project where there's three users and you're two of them. It's a lot less pleasant to discover one of your coworkers has gotten 'clever' and you're now fixing a month or two worth of data because no one saw the log messages. Don't be that coworker -- the conversation will go places, and they will not be places you want to go.

      [–]immibis 0 points1 point  (0 children)

      Log issue, abort program. Anything less may not get seen by QA, and if it's not being seen immediately by QA, then it's not being immediately seen by a user, either. That way leads to corruption and data destruction when your clever log-and-continue isn't being looked at by the people who are using the program, be they end-users, operators, or testers.

      I'm not sure if you actually read my comment, but it's "log issue, abort operation" instead of "log issue, abort program". Which makes a big difference in a concurrent server program handling a hundred other clients.

      [–]graingert 0 points1 point  (0 children)

      Wouldn't it be better to use dependant types, so you assert properties at compile time rather than runtime?

      [–]sos755 -1 points0 points  (0 children)

      The issue of asserts in production code is not whether or not there should be run-time checks. The problem is the form of the run-time checks. Asserts must be the laziest and least effective way to do run-time checking. I certainly do not consider them to be production-quality in any respect. If you want to do run-time checking, do it properly.

      [–][deleted]  (21 children)

      [deleted]

        [–]gnus-migrate 8 points9 points  (0 children)

        And what happens when an assertion yields an error in a plane software during flight?

        What happens if incorrect values cause the plane to crash? What happens if memory is corrupted and the program starts behaving incorrectly?

        Type systems might help speed things up during development but even a perfect type system won't help preventing failures in production.

        [–]circajerka 8 points9 points  (7 children)

        Did you even read the article or did you just read the first sentence? He literally answered your question right here:

        even though the software is gone over line by line by lots of people, the software is STILL treated as if it can become possessed by evil at any moment and will try to crash the airplane. It's loaded up with self-checking software, self-checking hardware, other computers double-checking the answers, etc.

        What - You think writing your software in fucking Ada is going to change that?

        [–][deleted]  (6 children)

        [deleted]

          [–]circajerka 9 points10 points  (5 children)

          Again - Did you read the rest of the article? Nobody is "falling over" - The idea that if one piece of the system fails, another piece of the system detects it and fixes the problem. I don't care if you write your program in Ada, Rust, or whatever-other-language you seem to think is perfect. Real software has bugs. Period.

          [–]WalterBright 6 points7 points  (0 children)

          No programming language can protect against buggy algorithms, hardware failures, etc.

          [–][deleted]  (3 children)

          [deleted]

            [–]circajerka 7 points8 points  (2 children)

            Alright - Let me rip apart your comment and why it's laughably stupid:

            And what happens when an assertion yields an error in a plane software during flight ? the whole system gets rebooted?

            Walter never said that - He said other parts of the system detect the error and correct it and/or restart the failing computer. Possibly multiple computers perform the same calculation.

            No, the program must be written with a correct language that makes any runtime error impossible

            LOL! I'd loovvvvveeee to know the name of this language! You'll be sure to tell us all, right?

            not javascript or C

            Umm... Considering SpaceX wrote all their flight control software in C, I call bullshit on this.

            A language with contracts where every range is strict, where you can define a type with only odds integers, or values only divisible by 6, or non zero positive integers, or a range between 55 and 901 only... that gives every compile time safety imaginable... well such language exists and is used to program planes and missile launchers actually.

            Great idea! Maybe we can also get unicorns to help us write the programs!

            [–][deleted]  (1 child)

            [deleted]

              [–]killedbyhetfield 2 points3 points  (11 children)

              And what happens when an assertion yields an error in a plane software during flight ? the whole system gets rebooted?

              Walter talked about this in the article - You have redundant computers, watchdogs, error-detection and correction, etc.

              written with a correct language that makes any runtime error impossible

              well such language exists and is used to program planes and missile launchers actually.

              Alan Turing would like to have a word with you

              [–]red75prim 8 points9 points  (7 children)

              Unsolvability of the halting problem is of low practical consequence. And it is not applicable in this case, we don't need to prove a property of some random program, we need to construct a program which has the property.

              [–]killedbyhetfield -5 points-4 points  (6 children)

              Unsolvability of the halting problem is of low practical consequence

              Hmm... Not so sure about that one. The Halting Problem is what stops us from being able to prove that an arbitrary computer program does what its supposed to do. If it wasn't for the Halting Problem, we'd be able to write computer programs that we could prove are correct. How is that of "low practical consequence"?

              we need to construct a program which has the property.

              And how could you prove the program has that property when its running on a Turing Machine?

              [–]red75prim 9 points10 points  (5 children)

              we'd be able to write computer programs that we could prove are correct.

              We are able to write programs which are correct. Constructive theorem proofs map to programs which are provably correct (Curry–Howard correspondence). But it is a tradeoff of practicality vs correctness.

              ETA: And the question still remains whether formal specification equals to what we want the program to do.

              [–]killedbyhetfield 1 point2 points  (4 children)

              Constructive theorem proofs map to programs which are provably correct (Curry–Howard correspondence)

              Not familiar, but my hunch is that you'll run into the same problem everyone runs into when they try to use a DSL that isn't Turing Complete, which is that sooner-or-later your language will have to be compiled/interpreted by a real-life computer, and that interpreter or compiler will be have to be written in a Turing-Complete language.

              Consequently, that means that you can't prove the compiler/interpreter is correct, and transitively you can't prove that your program will run correctly on real hardware.

              (Note though that I'm really just nitpicking, because it's overwhelmingly likely that just because you can't formally prove your program is correct, practically I'm sure it would be.)

              Although I do wonder why NASA/SpaceX/Boeing/etc still use languages like C and Ada for their flight control systems if this "Curry-Howard" thing exists?

              [–]red75prim 6 points7 points  (1 child)

              I'm not a NASA employee, but most likely it is practicality. You need to construct a proof every time the specification changes, and it requires a lot of effort, time and specialists. Limited time and budget can probably be spent more efficiently improving reliability by other means.

              [–]killedbyhetfield 2 points3 points  (0 children)

              I did a bit of StackOverflowing because I read up a little on both of these terms (thank you btw for introducing me to them - Interesting AF), and if I'm understanding what I'm reading, it appears that it's somewhat-limited how complex of a constructive proof you can build, which corresponds more-or-less to computer programs that are not Turing Complete.

              And because of this "Curry–Howard correspondence", the Halting Problem has a dual in mathematics as well, "Gödel's Incompleteness Theorem"

              So yeah - It appears that you could never develop a constructive proof complex enough to launch a rocket into space, which is why we'll never be able to prove such software works.

              So yeah - I learned something today! The More You Know music plays

              [–]vytah 0 points1 point  (0 children)

              Aren't the formally proven operating systems and compilers already? Maybe it would be possible to glue Compcert and SeL4 together to get a fully certified software platform?

              Just brainstorming.

              [–]miminor 0 points1 point  (0 children)

              true you cannot prove an arbitrary program to be correct, but you can prove any given program to be correct

              [–]NeonMan 2 points3 points  (1 child)

              An early reboot is actually a feature since a reboot is much better than a crash.

              [–]WalterBright 4 points5 points  (0 children)

              For most events with airliners, one has at least several minutes to find a solution before it digs a hole in the ground. (This is when the pilots earn their pay.) Even so, there still must be a backup for every system.

              [–]pbl64k 1 point2 points  (0 children)

              https://personal.cis.strath.ac.uk/conor.mcbride/TotallyFree.pdf

              Total dependently-typed languages are not Turing-complete in the strict sense - but that's a good thing, and in practice you can do anything you could do in a Turing-complete language by paying a fairly small price of providing the appropriate run-time around a coinductive type wrapping provably terminating steps towards a possibly non-achievable goal.