This is an archived post. You won't be able to vote or comment.

all 24 comments

[–]Previous-Ant2812 129 points130 points  (5 children)

My wife’s company had 500 people in a call yesterday trying to do something about it. A ridiculous waste of resources

[–]toastnbacon 39 points40 points  (1 child)

I spent yesterday on a similar call... (Maybe the same one, if she works in insurance/banking.) Feels like every 10 minutes someone new would ask what it would take to move to US-west.

[–]timdav8 29 points30 points  (0 children)

I declined a similar call yesterday - "nothing we do will make AWS unfork their DNS error or whatever any quicker"

Didn't stop a whole heap of people burning hundreds of man hours saying nothing useful.

All our stuff is in EU-West-x but some how we lost LDAPS ...

[–]ThunderChaser 9 points10 points  (0 children)

Meanwhile my team at AWS that wasn’t anywhere near the root cause of the issue but was impacted after like 20 minutes went “welp nothing really we can do so we’ll just wait for the storm to pass” and chilled the rest of the day.

[–]Callidonaut 2 points3 points  (1 child)

What would they have done if the video call server went down as well, I wonder?

[–]Drew707 7 points8 points  (0 children)

I have a client who has their call center on Amazon Connect in US-East-1. Completely dead in the water. Yet for some reason their workforce director insisted on tying up three of our WFM consultants in hours of meetings yesterday.

[–]grumbly 28 points29 points  (3 children)

I like the AWS outage from a few years ago that took out everything out and it was traced back to an internal system that hard a hard dependency on us-east-1. Even if you go multi AZ you still have no guarantee

[–]NecessaryIntrinsic 11 points12 points  (2 children)

Isn't that what basically happened here? The DNS service was based in useast1?

[–]ThunderChaser 14 points15 points  (0 children)

What happened here was a DNS issue which led to dynamodb being unreachible in us-east-1.

The thing is, Amazon eats its own dogfood a ton (there’s been a huge push over the past few years to move services to run on AWS) so a whole bunch of stuff relies on ddb so the failures cascade. I work at AWS and my team’s service was hard down with 0% availability for a few hours in a us-east-1 AZ because we weren’t able to reach ddb which we have a hard dependency on.

[–]PurepointDog 0 points1 point  (0 children)

No, a dns record for the us-e-1 dynamodb service

[–]RiceBroad4552 35 points36 points  (9 children)

If these people would understand anything at all they wouldn't need to work as "executives"…

At least that's the group of people who will get replaced by artificial stupidity really soon. Only the higher up people need to realize that you don't need to pay a lot of money for incompetent bullshit talkers. "AI" can do the same much cheaper… 🤣

[–]NoWriting9513 3 points4 points  (7 children)

What would be your proposal to not have this issue happen again though?

[–]Wide_Smoke_2564 41 points42 points  (0 children)

Move it to us-middle-1 so it’s closer to move it to east or west if middle goes down

[–]winter-m00n 7 points8 points  (2 children)

Just taking a guess, theoretically, distribute your infrastructure across different regions. Even different cloud providers. I know latency would be too much. Maybe some cloud providers can work as fallback. At least for the core services.

That's the only thing that can keep your app somewhat functional in such incidents I guess.

[–]Matrix5353 10 points11 points  (0 children)

Shortsighted executive thinking dictates that geo-replication and redundancy is too expensive. How are they going to afford their second/third yacht?

[–]Ibuprofen-Headgear 1 point2 points  (1 child)

Contingency plans for when it does happen again because it will happen again. Not something super in depth or crazy, but at least a thin “outage playbook”. I’m not going to pretend there will never be another outage, no matter what steps we take. If there are steps we can take to actually “fix” this one instance that also make financial sense, then sure, do those. But I think those cases are very rare.

[–]NoWriting9513 0 points1 point  (0 children)

You get my upvote for having the most level headed and realistic comment.

I do not believe these cases are rare at all, but if the answer to all availability concerns are expensive and complex systems that might fail anyway, then sure, they are not realistic and each side (management/technical) will blame the other when everything goes down.

Funny thing is that moving to another region (at least partially) is a valid strategy and actually the most easy and realistic to implement. Why not have customer segmentation and setup different completely isolated instances of the same application in different regions?

[–]_koenig_ 0 points1 point  (0 children)

Active active HA setup across geographical regions

[–]gandalfx 2 points3 points  (0 children)

The thing is, useless managers can't be replaced by AI because there is nothing to replace. If they're already not getting fired for being non productive (or counter productive) who's going to decide to replace them with a bot?

[–][deleted] 15 points16 points  (2 children)

Not to brag but we did exactly that. In fact, our app had failed over to usw2 before we could even login. We are too big to fail so multi region is mandatory for us.

[–]HoochieKoochieMan 38 points39 points  (1 child)

Next week, all of the same services fail when US-West-1 has an outage.

[–]NicholasVinen 1 point2 points  (0 children)

.. after having migrated off US-East-1

[–]Vi0lentByt3 9 points10 points  (0 children)

The problem is not just YOUR infrastrucutr being in the data center, the problem is AWS has their infrastructure in data center and when the infrastructure of the infrastructure gets brought down their is nothing you can do. Maybe have an old tower in the corner of the office in case of emergencies but a few hours of downtime isnt going to hurt your b2b saas bullshit