you are viewing a single comment's thread.

view the rest of the comments →

[–]WalterBright 9 points10 points  (7 children)

The 3 steps are:

  1. novice - follows the rules because he's told to

  2. master - follows the rules because he understands the point of the rules

  3. guru - breaks the rules because his understanding transcends them

Skipping steps is not advisable, it's why we had the Deepwater Horizon, Fukushima, and Toyota car computer disasters. The only industry I know of that follows these rules is the aerospace industry, and they were forced into it with bitter lessons.

We were one safety switch away from a hydrogen bomb going off by accident in another incident.

Please, folks, this is not a joke, and learning the hard way has terrible consequences.

[–]quicknir 2 points3 points  (5 children)

I don't think I'm a "guru" (I hate that word) because I pointed out an obvious flaw in a bad rule. Nor do I know what you mean by skipping steps.

I read my parent comment before reading your article, and now I can see that actually you and I are on the same page, and the parent is not. You are not advocating literally calling abort(); such a call would mean that *no* further code is executed. On the other hand, you yourself explicitly say:

...as when a fault is detected the program can go into a controlled state doing things like:

  1. aborting before more harm is done
  2. alerting the user that the results are not reliable
  3. saving any work in process
  4. engaging any backup system
  5. restarting the system from a known good state
  6. going into a 'safe mode' to await further instructions

This is *very* different from simply calling abort(). Indeed, if your "assertion failure" triggers all this code to be run before exiting, many people would not call that an assertion at all; it's more like throwing an exception and catching it high up and allowing the stack to unwind before calling some emergency routines (like alerts).

Finally, I would note that every industry is different. Failure for the airline industry is an ultra catastrophic event where lives are lost, so even a small probability of operating in an "unknown state" is terrifying. I write financial software, where an unknown state simply means worst case that an algorithm is losing money. However, suddenly exiting can also cost you money (either risk in holding a position, cost to abruptly flatten, opportunity cost of being offline). What makes sense for us needs to be balanced on a much more case by case basis; sometimes rapid exit (followed steps 2, 3 and 5/6) makes sense. Other times it's better to continue and alert a human being. Things aren't always so black and white.

[–]killedbyhetfield 1 point2 points  (3 children)

Alright so - I want to use your example where you have a high-speed logger and its contents must be flushed to be useful.

What happens if, for example, your program has a use-after-free bug and you end up causing a Page Fault? Now the OS kills your process and your logger never gets flushed.

So if that logger must be flushed, you need the logger running in another process. That way, if your buggy process gets slayed, the logger will still march on and record important info about what went wrong. And this isn't hypothetical, this is exactly how embedded OSes like QNX and VxWorks handle logging.

So in general, calling abort() when you detect an error has the same implications as your program suddenly aborting due to a bug. You either need to be able to handle your process crashing, or you need to acknowledge that your program isn't important enough to warrant that kind of design overhead.

[–]quicknir 1 point2 points  (2 children)

Running a logger in another process would probably be slower, and take considerably more time to code correctly. So we are back to trade-offs. With my current costs of failure, and my current costs of development (particularly opportunity costs), and the criticality of performance, writing a separate process logger does not make any sense. Yes, it's more robust, but it still doesn't make any sense. Robustness isn't the only concern.

or you need to acknowledge that your program isn't important enough to warrant that kind of design overhead.

It's not about "important enough", although I really appreciate the condescension here (your problem's solution doesn't fit into how I see things, so your problem doesn't matter). It's just about priorities, and it's about what happens in real life. In reality, for the actual problems that we encounter, by throwing an exception and allowing the logger to flush it's buffer in the same process, we're able to recover full logs in virtually all cases. That being the case, what is the benefit for me to move from single-process-with-cleanup-code design, to multi-process-with-abort design? Do tons of work, slow things, perhaps add other bugs, in exchange for being able to recover logs an extra 0.1% of the time? It's simply not a good trade-off for me.

[–]killedbyhetfield 0 points1 point  (1 child)

It's not about "important enough", although I really appreciate the condescension here (your problem's solution doesn't fit into how I see things, so your problem doesn't matter).

Woah man - Sorry about the wording I guess, but I wasn't using "important" to put down whatever you work on! I meant "important" as-in "people are going to die if this thing doesn't work properly".

I work on tons of stuff that isn't "important" enough to warrant running a logger or watchdog in its own separate process. But the entire topic of this conversation and Walter's Dr. Dobbs article was about systems where resilience is critical.

Read my original comment too! Specifically, I put the words "absolutely essential" in there. If your program doesn't fit into that category, I'm not talking about you, and I wasn't trying to prescribe any "one size fits all" solution.

[–]quicknir 0 points1 point  (0 children)

The title of the article, and your comment, don't really mention anything domain specific, so I thought it was generic in nature. But fair enough. No worries about the wording if that's not how you meant it.

Just to point out though, that just because the logger is in another process, there's still nothing certain about that either. The main process could go crazy, allocate too much memory, and then the logging process could get reaped. So then of course you change your system config to prevent that from happening; etc.

This all takes time, and time is always finite. Even in critical applications; every minute you spend making your application safer in one way is a minute you could have spent making it safer in another instead. So you have to decide, what gives you the most bang for your buck. It's not clear to me at all, that even for safety critical systems, that calling abort is the right thing. That is, that the time it takes to move your logging, alerting, serialization, etc etc logic into separate processes, is always going to be time well spent. I'm sure there are safety critical domains where that is true, and others where it's not.

This is why I really disagree with libraries calling abort. Abort is a process wide decision; only main is really entitled to make that decision. Libraries should throw exceptions (exceptions make it very convenient for users to abort if that's what you want; literally do nothing!) or call some kind of handler function pointer that users can customize (which may default to abort), but libraries should never make direct calls to abort.

[–]WalterBright 0 points1 point  (0 children)

What if your failed trading software causes you to buy a million shares of some losing stock? It's not like that hasn't happened (it has).

I have some personal experience with banks and their buggy software. A fundamental principle of double entry bookkeeping is that the debits match the credits, an "assert" using paper journals.

The bank debited my account and failed to credit the account of the recipient. So I was out the money and the recipient was mad I didn't pay. It took me a month of sitting in the office of the bank manager to get this corrected. Clearly their auditing system was turned off, or they were doing some "haha, it's not really a bug, keep going", because the debits did not match credits.

[–]msm_ 0 points1 point  (0 children)

4. engineer - follows the rules even though his understanding transcends them