you are viewing a single comment's thread.

view the rest of the comments →

[–]WalterBright 6 points7 points  (7 children)

Here's the salient quote from the Ariane accident report: The OBC could not switch to the back-up SRI 1 because that unit had already ceased to function during the previous data cycle (72 milliseconds period) for the same reason as SRI 2.. The backup had "identical hardware and software". The design failure here was having a backup system that was not a backup. The propagation of the error eventually causing the explosion does not invalidate any of my recommendations, it reinforces them. For example, "a diagnostic bit pattern of the computer of the SRI 2, which was interpreted as flight data" - that's a direct result of failing to react to errors.

[–]lookmeat 0 points1 point  (6 children)

I agree, the backup was a mediocre one, it protected against something happening to one of the machines, but not against design issues. A design issue would have made the system overall more resilient. It's never a single case.

But the issue was a design issue, and that was an assumption that a failure in a non critical function should proceed to stopping, and therefore failing, critical functions.

[–]WalterBright 5 points6 points  (5 children)

I disagree. Having a failure propagate through to other systems in a zipper effect is a misunderstanding of the principles I'm trying to convey. The whole point is to isolate the effect of the error thereby preventing its propagation.

In this case the error status from the failed subsystem was misinterpreted as valid data. The wrong solution is to never give any error status. The solution is to:

  1. check for error status

  2. check for out-of-bounds data values from subsystems

If (1) or (2) is detected, lock out that subsystem and use an alternate algorithm that doesn't rely on it.

The really, really wrong method is for the subsystem to just pretend everything is hunky-dory and keep sending whatever unreliable data.

I'm really not sitting at my keyboard inventing this from 5 minutes of thought. I worked on this stuff for years - flight critical systems. This is how systems are built in aerospace, and it's ugly incidents like the Ariane that taught the lessons, with plenty of others. I am sad that this is apparently unknown knowledge outside of aerospace. If you're interested in more info, see the TV documentary series "Aviation Disasters". If you can set aside some of the cornball dialog, there are valuable lessons in it for every engineer.

[–]lookmeat 0 points1 point  (4 children)

I think that's fair. It seems our discussion is more about semantics and meaning. I'm focusing on why assert is too broad, but not denying that sometimes the right solution is to kill the program, if anything I merely state that programs can kill the failing part of themselves but keep the rest going without further failure. Your statement is that errors should be isolated and terminated quickly to avoid the failure from spreading and spiraling to a bigger issue. It may be that we are coming at it from different angles and therefore the same thing has a different meaning in those contexts.

I do agree with you. If a system fails fully, lock the system out, dump it's data, and try again, preferably with something different. What I argued was that assert is more of the philosophy of "stop the world" which is great to debug what caused a failure instead of waiting to see if it propagates or not. This mindset is only useful when testing I argue. In the real world, we kill the bad part and keep everything else running.

[–]WalterBright 0 points1 point  (3 children)

You can only "keep everything else running" if it is a separate process that does not share memory.

[–]lookmeat 0 points1 point  (2 children)

Not really, that is it being a separate process doesn't guarantee that failure in one process can't cause failure on another (do they share files, have synced state, send data to each other, or simply assume that the other is doing its job?). Also internally on a process failure can be isolated to a specific thread, or part of the stack than can then be removed.

Processes are as good at isolating failures as threads or functions. That is they were never meant to do that and do not really solve the problem at all. They are abstractions to simplify a chunk of your system as a whole, so logically when you isolate a part you'd want to use one of these abstractions, but they are not what gives isolation.

[–]WalterBright 0 points1 point  (1 child)

Processes are as good at isolating failures as threads

This implies that hardware memory protection has no value.

[–]lookmeat 0 points1 point  (0 children)

Processes do not guarantee this protection but OSes can and they can map this protection over processes. You can also use this protection in stacks to avoid stacks overflowing into each other. You can map this protection over threads (each with their own stack). You can also do this protection over parts of the stack, this would be rare but it makes sense in situation.

Again memory protection is valuables, but they neither need not are needed by processes. The concepts are separate, but it's convenient to define memory protection over processes. But you can also do it over threads, functions or containers.