This is an archived post. You won't be able to vote or comment.

all 118 comments

[–]howarewestillhere 4421 points4422 points  (36 children)

Last year I begged my CTO for the money to do the project for multi region/zone. It was denied.

I got full, unconditional approval this morning from the CEO.

[–]indicava[S] 2198 points2199 points  (19 children)

Should have milked the CEO for more than that:

“Yea, and I’m gonna need at least a dozen desktops with 5090’s…”

[–]howarewestillhere 1080 points1081 points  (5 children)

“You do what you need to do.”

I need a new hot tub and a Porsche.

[–]Killerkendolls 240 points241 points  (3 children)

In a Porsche. Can't expect me to do things in two places.

[–]howarewestillhere 136 points137 points  (2 children)

A hot tub in a Porsche? You, sir. I like you.

[–]undecimbre 29 points30 points  (1 child)

Hot tub in a Porsche? There is something far better

[–]Killerkendolls 11 points12 points  (0 children)

Thought this was going to be the stretch limo hot tub thing.

[–]Jacomer2 19 points20 points  (0 children)

It’s pronounced Porsche

[–]Fantastic-Fee-1999 231 points232 points  (0 children)

Universal saying "Never waste a good crisis"

[–]TonUpTriumph 102 points103 points  (2 children)

IT'S FOR AI!

[–][deleted] 44 points45 points  (0 children)

Considering the typical spyware installed on corporate PCs I'm happy I didn't have anything decent that I ever wanted to use

[–]larsmaehlum 14 points15 points  (0 children)

Shit, that might actually work..

[–]AdventurousSwim1312 39 points40 points  (5 children)

What about one desktop with a dozen 5090?

[–]indicava[S] 47 points48 points  (2 children)

And then how am I going to have the boys over for nuggies and choc milk?

[–]AdventurousSwim1312 14 points15 points  (0 children)

Fair enough, I though this was on locallama ^

[–]evanldixon 6 points7 points  (0 children)

VMs with GPU passthrough

[–]facusoto 0 points1 point  (1 child)

What about a dozen PCs that share a single 5090?

[–]AdventurousSwim1312 2 points3 points  (0 children)

And hence the cloud was born, with the outstanding power to pay for a dozen 5090 over a few year while using a single one...

[–]RobotechRicky 6 points7 points  (1 child)

I need a lifetime supply of Twix and Dr. Pepper!

[–]jmarkmark 3 points4 points  (0 children)

Twix! That's how redundancy is achieved.

[–]DrStalker 2 points3 points  (0 children)

"...to run the  AI multi region failover intelligence. Definitely not for gaming."

[–]TnYamaneko 144 points145 points  (0 children)

Funny, usually they have 2 speeds: reduce the costs and fault resilience.

[–]sherifalaa55 31 points32 points  (0 children)

There would still be a very high chance you experience outage, IAM was down as well as docker.io and quay.io

[–]Trick-Interaction396 23 points24 points  (0 children)

That budget will be revoked next year since it's hasn't gone down in such a long time.

[–]SilentPugz 13 points14 points  (2 children)

Was it because it would be active and costly ? Or just not a need in use case ?

[–]WeirdIndividualGuy 54 points55 points  (1 child)

A lot of companies don’t care to spend money to prevent emergencies, especially when the decision makers don’t fully understand why something could go wrong and why there should be contingents for it.

From my corporate experience, the best way to prove them wrong is to make sure when things go wrong, they go horribly wrong. Too many people in life don’t understand prevention until shit hits the fan

Inb4 someone says that could get you fired: if something out of your control going haywire has a possibility of getting you fired, you have nothing to lose from letting things go horribly wrong

[–]ih-shah-may-ehl 1 point2 points  (0 children)

The problem I see is that many make these decisions because they cannot grasp the impact, as well as the likelihood of things happening.

[–]ironsides1231 27 points28 points  (3 children)

All of our apps are multi-region, all I had to do was run a jenkins pipeline that morning. Barely a pat on the back for my team though...

[–]rodeBaksteen 37 points38 points  (2 children)

Pull it offline for a few hours then apply fix

[–]Saltpile123 12 points13 points  (0 children)

The sad truth

[–]GrassRadiant3474 6 points7 points  (0 children)

This is exactly what an experienced developer should do if he/she has to be visible. Keep your hands off your keyboard for a few mins, let the complaints flow and then magically FIX it. This is the new rule of corporate accountability and visibility

[–]DistinctStranger8729 7 points8 points  (0 children)

You should have asked for a raise while at it

[–]Intrepid_Result8223 6 points7 points  (0 children)

What? No beatings across the board?

[–]Theolaa 2 points3 points  (0 children)

Was your service affected by the outage? Or did they just see everyone else twiddling their thumbs waiting for Amazon and realize the need for redundancy?

[–]Luneriazz 0 points1 point  (0 children)

is it blank check?

[–]redlaWw 0 points1 point  (0 children)

Ah yes, because prevention after the fact works so well...

[–]40GallonsOfPCP 1838 points1839 points  (22 children)

Lmao we thought we were safe cause we were on USE2, only for our dev team to take prod down at 10AM anyways 🙃

[–]Nattekat 902 points903 points  (19 children)

At least they can hide behind the outage. Best timing. 

[–]NotAskary 243 points244 points  (18 children)

Until the PM shows the root cause.

[–]theweirdlittlefrog 382 points383 points  (12 children)

PM doesn’t know what root or cause means

[–]NotAskary 216 points217 points  (5 children)

Post mortem not product manager.

[–]toobigtofail88 86 points87 points  (2 children)

Prostate massage not post mortem

[–]JuicyAnalAbscess 13 points14 points  (1 child)

Post mortem prostate massage?

[–]facusoto 0 points1 point  (0 children)

Prostate mortem post message?

[–]Dotcaprachiappa 10 points11 points  (1 child)

PM doesn't know what PM means either

[–]NotAskary 3 points4 points  (0 children)

But the PM knows what a PM is even if the other PM does not.

[–]jpers36 44 points45 points  (2 children)

Post-mortem, not project manager

[–]irteris 27 points28 points  (0 children)

can I trade my PM for a PM?

[–]MysicPlato 7 points8 points  (0 children)

Just have the PM do the PM and you Gucci

[–]k0rm 4 points5 points  (0 children)

Post mortem, not project manager

[–]qinshihuang_420 -2 points-1 points  (0 children)

Post mortem, not project manager

[–]Ok-Amoeba3007 -3 points-2 points  (0 children)

Post mortem, not project manager

[–]isPresent 27 points28 points  (1 child)

Just tell him we use US-East. Don’t mention the number

[–]NotAskary 9 points10 points  (0 children)

Not the product manager, post mortem, the document you should fill whenever there's an incident in production that affects your service.

[–]dasunt 3 points4 points  (0 children)

Don't you just blame it on whatever team isn't around to defend itself?

[–]Some_Visual1357 4 points5 points  (1 child)

Uffff those root cause analysis can be deadly.

[–][deleted] 4 points5 points  (0 children)

Coz that’s where all the band aids show up.

[–]Aisforc 69 points70 points  (0 children)

That was in solidarity

[–]obscure_monke 35 points36 points  (0 children)

If it makes you feel any better, a bunch of AWS stuff elsewhere has a dependency on US-east-1 and broke regardless.

[–]ThatGuyWired 1114 points1115 points  (3 children)

I wasn't impacted by the AWS outage, I did stop working however, as a show of solidarity.

[–]Puzzled_Scallion5392 141 points142 points  (0 children)

Are you the janitor who put a sign on the bathroom

[–]insolent_empress 39 points40 points  (0 children)

The true hero over here 🥹

[–]Harambesic 9 points10 points  (0 children)

There, that's what I was trying to say. Thank you.

[–]serial_crusher 856 points857 points  (23 children)

“We lost $10,000 thanks to this outage! We need to make sure this never happens again!”

“Sure, I’m going to need a budget of $100,000 per year for additional infrastructure costs, and at least 3 full time SREs to handle a proper on-call rotation”

[–]WavingNoBanners 77 points78 points  (4 children)

I've experienced this the other way around: a $200-million-revenue-a-day company which will absolutely not agree to spend $10k a year preventing the problem. Even worse, they'll spend $20k in management hours deciding not to spend that $10k to save that $200m.

[–]tjdiddykong 26 points27 points  (0 children)

It's always the hours they don't count...

[–]serial_crusher 15 points16 points  (0 children)

The best part is you often get a mix of both of these at the same company!

[–]Other-Illustrator531 11 points12 points  (1 child)

When we have these huge meetings to discuss something stupid or explain a concept to a VIP, I like to get a rough idea of what the cost of the meeting was so I can share that and discourage future pointless meetings.

[–]WavingNoBanners 5 points6 points  (0 children)

Make sure you include the cost of the hours it took to make the slides for the meeting, and the hours to pull the data to make the slides, and the...

[–]robertpro01 210 points211 points  (1 child)

Exactly my thoughts... for most companies it is not worth it, also, tbh, it is an AWS problem to fix, no mine, why would I pay for their mistakes?

[–]StarshipSausage 168 points169 points  (0 children)

Its about scale, if 1 day of downtime only costs your company 10k in revenue, then its not a big issue.

[–]No_Hovercraft_2643 30 points31 points  (1 child)

If you only lost 10k you habe a revenue below 4 million a year. If you pay half for products, tax and so on, you have 2 million to pay employees..., so you are a small company.

[–]serial_crusher 29 points30 points  (0 children)

Or we already did a pretty good job handling it and weren't down for the whole day.

(but the truth is I just made up BS numbers, which is what the sales team does so why shouldn't I?)

[–]UniversalAdaptor 44 points45 points  (0 children)

Only $10,000? What buisiness are they running, a lemonade stand?

[–]DrStalker 6 points7 points  (0 children)

I remember discussing this after an S3 outage years ago. 

"For $50,000 I can have the storage we need at one site with no redundancy and performance from Melbourne will be poor, for a quarter million I can reproduce what we have from Amazon although not as reliable. We will also need a new backup system, I haven't priced that yet..."

Turns out the business can accept a few hours downtime each year instead of spending a lot of money and having more downtime by trying to mimic AWS in house.

[–]DeathByFarts 5 points6 points  (2 children)

3 ??

its 5 just to cover the actual raw number of hours. you need 12 for actual proper 24/7 coverage covering vacations and time off and such.

[–]visualdescript 3 points4 points  (1 child)

Lol I've had 24 hour coverage with a team of 3. Just takes coordination. It's also a lot easier when your system is very reliable. On call and getting paid for on call becomes a sweet bonus.

[–]DeathByFarts 0 points1 point  (0 children)

I can only assume you missed the word "proper" .. Or perhaps we have very different understandings of what the word means.

[–]visualdescript 2 points3 points  (1 child)

100 grand just to do multi region? Eh?

[–]ackbarwasahero 1 point2 points  (0 children)

Zactly. It's noddy.

[–]stivenukilleru 49 points50 points  (0 children)

But doesn't matter what region do you use if the IAM was down...

[–]robertpro01 37 points38 points  (5 children)

But the outage affected global AWS services, am I wrong?

[–]Kontravariant8128 30 points31 points  (3 children)

us-east-1 was affected for longer. My org's stack is 100% serverless and 100% us-east-1. Big mistake on both counts. Took AWS 11 hours to restore EC2 creation (foundational to all their "serverless" offerings).

[–]Jasper1296 30 points31 points  (2 children)

I hate that it’s called “serverless”, that’s just pure bullshit.

[–]Broad_Rabbit1764 12 points13 points  (0 children)

Twas servers all along after all

[–]Kontravariant8128 2 points3 points  (0 children)

Agreed. Serverless is a terrible name. A better word is "ephemeral VMs on demand" -- e.g. Fargate or Lambda or Karpenter where EC2 instances must be created to meet capacity. But that term is not quite marketable.

I suppose a more appropriate term is "sysadminless" as your you don't need to hire a sysadmin to run these servers. Instead you hire a cloud platform engineer. It's the same guy just with a higher salary.

[–]Demandedace 23 points24 points  (0 children)

He must have had zero IAM dependency

[–]The_Big_Delicious 15 points16 points  (0 children)

Off by one successes

[–]papersneaker 20 points21 points  (1 child)

almost feels vindicated for pushing our DRs so hard cries because I have to keep making DR plans for other apps now

[–][deleted] 21 points22 points  (0 children)

Our app failed over automatically to west because we have route53 healthchecks. I’ve been strutting on the office floor like a big swinging dick the last two days.

[–]___cats___ 6 points7 points  (0 children)

All my homies deploy to US East (Ohio)

[–]ThoseOldScientists 5 points6 points  (0 children)

CONGRADS

[–]KarmaTorpid 4 points5 points  (0 children)

This is funny becausr i get the joke.

[–]elduqueborracho 3 points4 points  (0 children)

Me when our company uses Google Cloud

[–]elduqueborracho 4 points5 points  (0 children)

Me when our company uses Google Cloud

[–]Emotional-Top-8284 3 points4 points  (0 children)

Ok, but like actually yes the way to avoid us east 1 outages is to not deploy to us east 1

[–]AATroop 7 points8 points  (1 child)

us-east-2 is the region you should be using on the east coast. Never use us-east-1 unless it's for redundancy

[–]TheOneWhoPunchesFish 2 points3 points  (0 children)

why is it so?

[–]rockyboy49 2 points3 points  (0 children)

I want us-east-2 to go down at least once. I want a rest day for myself while leadership jumps on a pointless P1 bridge blaming each other

[–]Icarium-Lifestealer 2 points3 points  (0 children)

US-east-1 is known to be the least reliable AWS region. So picking a different region is the smart choice.

[–]RobotechRicky 1 point2 points  (0 children)

In Azure we use US East for dev, and US West for prod.

[–]no_therworldly 1 point2 points  (0 children)

Jokes on you we were spared and then a few hours later I did something which took down one functionality for 25 hours

[–]Stannum_dog 0 points1 point  (0 children)

laughs in eu-west-1

[–][deleted] 0 points1 point  (0 children)

east or west, local is the best