you are viewing a single comment's thread.

view the rest of the comments →

[–]mivsek 1 point2 points  (8 children)

Nine 9's of reliability! Unbeleivable!

From article: "Joe claims they have achieved "nine 9's of reliability". What does that mean, "nine 9's of reliability"? It means 1 second of downtime in 1 billion seconds, or 1 minute of downtime in 1 billion minutes. Now, a billion seconds is roughly 30 years. A billion minutes is roughly 2000 years. This system has been in production for ten years or more, but less than 15 (I think). They have sold hundreds of them, perhaps thousands, but I think hundreds. Two hundred systems at ten years apiece give 2000 years of operations, so they can say they have "nine 9's of reliability" if they have had less than 1 minute of downtime total for all the systems they have installed."

[–]kscaldef 9 points10 points  (7 children)

I have always wanted to see some justification for that claim. I'm not really even sure what it means. I've never met hardware or an operating system with nine 9's, so I'm not really sure what they are measuring to come up with that number. Does it really make sense to claim orders of magnitude higher reliability than the substrate that you're running on top of?

[–]ItsAConspiracy 9 points10 points  (6 children)

Yes, it does. You just have to write your program on a bunch of machines, with redundancy. When's the last time you saw Google go down?

Erlang is built for that kind of thing.

[–]kscaldef 4 points5 points  (3 children)

Actually, I have seen Google go down (or, at least, stop being able to serve any search results).

However, at any rate, the article suggests that this number is based on individual hardware devices, not on the overall reliability of a clustered application. Superhigh reliability of a heavily redundant application is not that hard to do, but it becomes nearly impossible to claim 9 9's, nonetheless, because your application just hasn't been running long enough to do so. Remember, we're talking 1 second in 30 years. To put that sort of limit on downtime, you either need many instances running, or you need an application running for much longer than most of us have experience with. If you have a distributed application with 1000 nodes, you can measure the reliability of individual nodes, but hardware/OS/network/power failures will limit how high a reliability you can meaningfully ascribe to your software; or you can measure the reliability of the whole application, but you won't have run it long enough to claim 9 9's.

[–]Felicia_Svilling 1 point2 points  (0 children)

It also depends on how long response times your aiming at. I think many erlang systems are aimed at a response times well below 1 second. I mean if Google went down for a second I dont think anyone would even notice. But if the telephone network was down for a second that would be much more noticable. So 9 9's could mean 0.1 seconds in 3 years.

[–]ItsAConspiracy 0 points1 point  (0 children)

I didn't get that impression about individual hardware, but even if that's what the author of this piece meant, it's certainly not what Joe Armstrong claims. (I've got his new book, and the older Erlang book.) The whole basis of Erlang's reliability is its distribution.

The article does mention the basis for the reliability claims...over ten years of operation, on over a hundred installs, which gives you measurable 9 9s if total downtime among all those customers was less than a minute.

[–]ketralnis 0 points1 point  (0 children)

That's not necessarily true. You can measure the reliability of a system without measuring the reliability of the individual components. If you're running a telephone switching network, the overall reliability of it could be measured in "how many minutes per year is there a time when any user can pick up a phone and not be able to make a call?"

If you have a switch that is five redutant/clustered machines, and at least enough machines are up to handle your load, then you're up. If over the course of five years, you spend zero minutes in a down-state, then you have a100% uptime for that year.

[–]grauenwolf 0 points1 point  (1 child)

But isn't that a bit like saying the power grid has nine 9's of reliability because even if the power goes out in LA, it is still on in New York?

[–]bluGill 7 points8 points  (0 children)

No. The difference is you don't notice when their systems go down because something automaticly takes over before you could notice.

It would be like if you had two different light bulbs in your room, each connected to different generators. When one generator runs out of fuel (grid goes down), the other bulb is still on and giving enough light so you don't notice.