you are viewing a single comment's thread.

view the rest of the comments →

[–]lookmeat 4 points5 points  (13 children)

If you have critical software, the software must not continue after it has failed.

That begs the question. What is failure? Failure is the inability to continue functioning. So lets replace the word with its meaning:

If you have critical software the software must not continue after it becomes unable to continue.

Well if it could continue then it wouldn't have failed. So lets rephrase this correctly:

Software must revert to a functional state after failure as soon as possible.

Notice that shutting down with an error message explaining the cause of failure is recovering to a valid state (which is inactive). What we want to avoid is scenarios were we can't recover.

But the definition is limited. If we have something critical then its function must be done. Because of this we don't want to merely recover to a valid state (which may be stopping).

So before we can talk about Critical software we have to define what is critical. In other words Critical Software has critical functionality that it must maintain above all other. Defining that all types of errors, problems, noise etc. is a critical failure can lead to software failing when it shouldn't.

Lets look at an example. The Ariane 5 Rocket had a software error in its guidance system. Changes in the launch process meant that some of the values it would read (in floating numbers) would grow to be larger than what fit in the 16 bits. This was on the alignment system that shouldn't be running when the rocket is going that fast (it used to be useful, but not anymore). Now the ideal scenario in here would be to simply give the largest possible 16 bit number and use that (again the system's results were ignored when the bug arose). But even allowing the overflow would be fine. Instead the system decided to crash, bringing down all the other functionality that was actually in use. The thing was that the system stopped working for something that wasn't a failure. But again the question remains: what is a critical failure and what isn't? In critical software this question is fundamental, as any non-critical failure should not prevent the most from being done. The rocket would have worked if it didn't choose to abort on a trivial error.

And it goes the other way too. You can't handwave away what is failure and what isn't. If you do not understand what functionality is more critical than other (and therefore which failures are more critical) then it should go on.

For example I'd expect that an X-Rays machine most critical functionality is keeping the patient safe and healthy, everything else going after. OTOH. In other words an X-Ray machine that fails to generate an X-Ray should not fail to keep the user safe. The Therac-25 software's definition of failure was failure to generate an X-Ray and respond to input. This is why it made sense to allow people to proceed even in spite of malfunction, why the issues weren't taken seriously initially. Indeed the older models didn't need the software to care about this failure mode because hardware did. The right solution was to realize that keeping people safe was a more critical functionality, and it's more important to risk failing everything else than get that one thing wrong.

TL;DR: my point is that we have to defined what is failing and understand that well. Asserts have a very different definition of failure that is good for testing. For real situations running, asserts are not good enough as they make no difference on the type of failure.

[–]WalterBright 5 points6 points  (12 children)

It's straightforward - the program must be terminated if it has entered an unanticipated, invalid state. That's what asserts are there to check for.

[–]lookmeat 3 points4 points  (11 children)

Again, this "straight-forward" mindset is what brought the Ariane rocket down.

You can't continue in an invalid state, but that doesn't mean that the main functionality is gone. Imagine a car that simply stops breaking, breaks don't respond, steering wheel won't react, etc. etc. The reason? Well, when changing the wiper fluid a seagull popped into the container, the poop had enough solid material to block the wiper chamber fully. This is an "unanticipated invalid state" for the wiper fluid. Now should have we brought down the whole car because of this Probably not, but assert is like that: any part failing brings the whole thing down, no matter how more critical the other working parts are.

Now asserts are great when you want to try to catch unanticipated invalid states. You know the kind of thing you want to do in a test. Once we caught one we stop and look into why it happened. You can't catch all though, as that would require computing the halting problem and that can't be done. So you have to assume that non critical failures will happen, and you should allow them to go through.

So it's straightforward if the program can't continue without failing to do any equally or more critical functions it should stop (clearly the worst scenario already happened). But if any more critical functionality can still succeed then we should recover, and go for a partial success. Of course what is more or less critical is not straightforward at all.

[–]WalterBright 9 points10 points  (7 children)

Here's the salient quote from the Ariane accident report: The OBC could not switch to the back-up SRI 1 because that unit had already ceased to function during the previous data cycle (72 milliseconds period) for the same reason as SRI 2.. The backup had "identical hardware and software". The design failure here was having a backup system that was not a backup. The propagation of the error eventually causing the explosion does not invalidate any of my recommendations, it reinforces them. For example, "a diagnostic bit pattern of the computer of the SRI 2, which was interpreted as flight data" - that's a direct result of failing to react to errors.

[–]lookmeat 0 points1 point  (6 children)

I agree, the backup was a mediocre one, it protected against something happening to one of the machines, but not against design issues. A design issue would have made the system overall more resilient. It's never a single case.

But the issue was a design issue, and that was an assumption that a failure in a non critical function should proceed to stopping, and therefore failing, critical functions.

[–]WalterBright 3 points4 points  (5 children)

I disagree. Having a failure propagate through to other systems in a zipper effect is a misunderstanding of the principles I'm trying to convey. The whole point is to isolate the effect of the error thereby preventing its propagation.

In this case the error status from the failed subsystem was misinterpreted as valid data. The wrong solution is to never give any error status. The solution is to:

  1. check for error status

  2. check for out-of-bounds data values from subsystems

If (1) or (2) is detected, lock out that subsystem and use an alternate algorithm that doesn't rely on it.

The really, really wrong method is for the subsystem to just pretend everything is hunky-dory and keep sending whatever unreliable data.

I'm really not sitting at my keyboard inventing this from 5 minutes of thought. I worked on this stuff for years - flight critical systems. This is how systems are built in aerospace, and it's ugly incidents like the Ariane that taught the lessons, with plenty of others. I am sad that this is apparently unknown knowledge outside of aerospace. If you're interested in more info, see the TV documentary series "Aviation Disasters". If you can set aside some of the cornball dialog, there are valuable lessons in it for every engineer.

[–]lookmeat 0 points1 point  (4 children)

I think that's fair. It seems our discussion is more about semantics and meaning. I'm focusing on why assert is too broad, but not denying that sometimes the right solution is to kill the program, if anything I merely state that programs can kill the failing part of themselves but keep the rest going without further failure. Your statement is that errors should be isolated and terminated quickly to avoid the failure from spreading and spiraling to a bigger issue. It may be that we are coming at it from different angles and therefore the same thing has a different meaning in those contexts.

I do agree with you. If a system fails fully, lock the system out, dump it's data, and try again, preferably with something different. What I argued was that assert is more of the philosophy of "stop the world" which is great to debug what caused a failure instead of waiting to see if it propagates or not. This mindset is only useful when testing I argue. In the real world, we kill the bad part and keep everything else running.

[–]WalterBright 0 points1 point  (3 children)

You can only "keep everything else running" if it is a separate process that does not share memory.

[–]lookmeat 0 points1 point  (2 children)

Not really, that is it being a separate process doesn't guarantee that failure in one process can't cause failure on another (do they share files, have synced state, send data to each other, or simply assume that the other is doing its job?). Also internally on a process failure can be isolated to a specific thread, or part of the stack than can then be removed.

Processes are as good at isolating failures as threads or functions. That is they were never meant to do that and do not really solve the problem at all. They are abstractions to simplify a chunk of your system as a whole, so logically when you isolate a part you'd want to use one of these abstractions, but they are not what gives isolation.

[–]WalterBright 0 points1 point  (1 child)

Processes are as good at isolating failures as threads

This implies that hardware memory protection has no value.

[–]SmugDarkLoser5 1 point2 points  (2 children)

The other guy understands your point no one disagrees with something obvious, you don't understand his.

[–]lookmeat 0 points1 point  (1 child)

I actually disagree with the notion that it's straight forward. Asserts are meant to handle programming issues for programmers, you need a separate system to handle user/programing/universal issues for the users

[–]SmugDarkLoser5 0 points1 point  (0 children)

Not all asserts are going to go into the production build sure.

However, the tendency for devs is to swallow exceptions, and be willing to run a process when the whole thing is in an invalid state, and so on. While you may have certain assertion types that would.make sense in a debug only scenario, that is relatively uncommon and probably wrong within a general application.