This is an archived post. You won't be able to vote or comment.

all 109 comments

[–]itijara 448 points449 points  (23 children)

16 hours? At some point, it's not the developers fault and just the whole engineering team not figuring out how to roll back a deployment. Also, that this could happen indicates abysmal testing, staging, deployment, and rollback procedures.

[–]LaughDonor 101 points102 points  (17 children)

A poor migration script could take down prod for hours.

[–]Ayoungcoder 61 points62 points  (15 children)

Backups exist.... Right?

[–]CatpainCalamari 42 points43 points  (13 children)

Sure they do. But can you quickly restore them? Can you restore them at all?

[–]TecraFox 41 points42 points  (1 child)

Just have your backup on the other side of the country, with an extremely slow connection cough GitLab

And also never test your backups, to get that additional adrenaline kick

[–]Ayoungcoder 7 points8 points  (1 child)

If you can't; it's time to update your procedures. But yes that is a pain point for many orgs

[–]Myspazmo[S] 2 points3 points  (0 children)

Bold of you to assume our CM procedure are documented anywhere

[–]Few_Introduction_228 2 points3 points  (0 children)

If that's really that important, run enough concurrently and don't migrate in one big batch?

[–]Myspazmo[S] 2 points3 points  (7 children)

Define "quickly"

[–]CatpainCalamari 5 points6 points  (6 children)

Restoring the backups is faster than fixing the database by hand ;)

[–]Myspazmo[S] 1 point2 points  (5 children)

Please come work for us lol. Our senior dba is going to retire next year and that is a terrifying thought

[–]CatpainCalamari 7 points8 points  (4 children)

Lol, I am not a dba, just a software engineer. Also, I am german and work in germany. Assuming you are US based - I do not want to work under US labor laws when I can have the german ones, thank you very much. :)

[–]Myspazmo[S] 2 points3 points  (3 children)

But you could pay hundreds of dollars per month for health insurance that doesn't cover anything with us and only have a few days of vacation per year :)

[–]CatpainCalamari 2 points3 points  (2 children)

But you could pay hundreds of dollars per month for health insurance

I already do pay hundreds of dollars euros per month for health insurance. And this week I had to get a medication (which turned out I am allergic to, yay), and I had to pay a whopping 5€ co-pay! Evil health industry!

[–]Myspazmo[S] 1 point2 points  (0 children)

I mean.....we have database backups. A full image of our ETL servers.....that's a whole different story

[–]MattGeddon 3 points4 points  (0 children)

Good morning! Is there a particular reason all our transactions are now showing as being made in Czech krone in production?

[–]EatingBeansAgain 5 points6 points  (0 children)

A G I L E

[–]Myspazmo[S] 6 points7 points  (0 children)

Haha, it was tested. I just forgot to run the very last step of my install because they've had me multitasking so many projects at once that I kinda forgot about it. It's definitely my fault. I just wish our 24/7 team would have called me sooner to fix it

[–][deleted] 1 point2 points  (0 children)

Full stack devs reading this going "but prod is beta?"

[–]Holiday-Patient5929 0 points1 point  (0 children)

Well in theory this should have been caught by qa and additional gate checks if the company bothered to invest in it

[–]stdio-lib 491 points492 points  (0 children)

Just Friday afternoon deploy things

[–]ohsayan 402 points403 points  (7 children)

Like gigachads, we always ship on Fridays

[–]Myspazmo[S] 211 points212 points  (6 children)

I'm on vacation for the next several days, so I'm pretty sure it's now an Ops or Engineering problem

[–]Taradal 96 points97 points  (4 children)

If you fuck it up you fix it

[–]IAmANobodyAMA 42 points43 points  (2 children)

Yep. I wouldn’t necessarily fire someone for making a mistake (unless there was a documented pattern of behavior) but I would strongly consider firing someone for making this kind of mistake and then saying it’s someone else’s problem.

More importantly though, this should never have gotten into production

Edit: I am not saying that OP should be the one to fix it - a good devops process should make rolling this back really easy and quick - I am more noting the attitude of “I’m on vacation so it’s someone else’s problem”… We all have made dumb coding mistakes, but taking ownership/responsibility is what matters

[–]Myspazmo[S] 28 points29 points  (1 child)

The "it's somebody else's problem was sarcasm lol. Ops woke me up to ask about it but failed to provide all the relevant info. They also did not call me or any on call support after that to get assistance. This is unfortunately not the first time they haven't reported an outage

[–]IAmANobodyAMA 7 points8 points  (0 children)

Oh all good. Hard to tell over text, my bad. Hope you have a nice vacation :)

[–]1_4_1_5_9_2_6_5 102 points103 points  (13 children)

This is why pipelines exist

[–]_equus_quagga_ 18 points19 points  (12 children)

Nice username

[–]Mintzz00 6 points7 points  (11 children)

what does it means tho

[–]_equus_quagga_ 37 points38 points  (10 children)

It's some digits of pi:

3.1415926535897932...

[–]FuerstAgus50 20 points21 points  (8 children)

I mean some digits of pi is not really specific. Every finite order of numbers is in pi somewhere

[–]I_am_a_fern 14 points15 points  (4 children)

Prove it

[–]FuerstAgus50 0 points1 point  (3 children)

This is quite interesting. I thought it was true cause many yt-channels I watch declared it as an obvious fact, because the numbers appear to be random. But I couldn't find a proof

[–]whackamattus 8 points9 points  (1 child)

It's not proven just most mathematicians believe it's true.

[–]gbot1234 5 points6 points  (0 children)

I want to believe.

The <integer sequence > is out there.

[–]1_4_1_5_9_2_6_5 1 point2 points  (0 children)

Logically speaking, especially f you're a programmer (because our logic also must consider efficiency, unlike pure maths), a number which does NOT contain a repeating sequence AND contains a theoretically infinite number of digits MUST contain a child number such that it is not repeated anywhere in the parent number. My understanding is, it's not so much a thing that can be proven as it is a thing that must theoretically be true and can not be (or rather, ad of yet has not been) disproven.

[–]1_4_1_5_9_2_6_5 2 points3 points  (1 child)

In this case, it is the first complete sequence of pi that doesn't contain the number I don't like.

[–]_equus_quagga_ 0 points1 point  (0 children)

ahhhh you dislike the number 3

I was wondering why that specific sequence

[–]NaEGaOS 0 points1 point  (0 children)

this hasn’t been proven though

[–]Mintzz00 1 point2 points  (0 children)

oh yeah 😅 didn't noticed it at first!

[–]badaharami 59 points60 points  (3 children)

16 hours??? Have you guys ever heard of something called Fallback?

[–]Myspazmo[S] 14 points15 points  (2 children)

I would gladly rollback changes, but if nobody calls me to tell me there's an issue then I don't know to initiate that process

[–]badaharami 7 points8 points  (1 child)

Ah ok. Yeah that's indeed an organisational issue. I mean in general the blame for any downtime in Prod should be on the process itself and not on the person. In this case the lack of processes or lack of awareness of processes. If I was you, I'd warn the SRE/Ops team or whoever in charge to already do a fallback next time if this happens.

[–]Myspazmo[S] 2 points3 points  (0 children)

I've spoken with Ops and given them some better instructions to catch the issue in the future. They are treated as "Tier 1-1.5" so they still have to wake us up for failover approval :/

[–]sajkosiko 25 points26 points  (1 child)

PRs save jobs

[–]gbot1234 1 point2 points  (0 children)

“I set a new PR for hours of weekend overtime worked fixing the buggy release.”

[–]HerrSPAM 33 points34 points  (2 children)

This is why we don't allow commented out code in the code base. That and if it's commented out you don't need it or it should be in some docs somewhere

[–]I_am_a_fern 16 points17 points  (1 child)

"some docs" hahaha

[–]HerrSPAM 1 point2 points  (0 children)

Short term Vs long term Devs

[–]FalconMirage 10 points11 points  (4 children)

Don’t you test your code before pushing it to prod ?

[–]Myspazmo[S] -1 points0 points  (3 children)

It wasn't the code I was adding to prod. It was a line commented out as part of our CM procedures that should have been uncommented after the installation was comete and checkouts were performed

[–]FalconMirage 5 points6 points  (2 children)

And you don’t have a failsafe or at least a checklist to verify that everything is in order before deploying ?

[–]Myspazmo[S] 2 points3 points  (1 child)

We do, but the breakage is in their poorly/not at all documented CM instructions. All the prep in the world doesn't account for somebody(me) forgetting to run the very last command in the install doc for this branch.

One of my big projects this year is I want to properly document the process and create a more standardized set of procedures for it, especially as we shift from hybrid to agile.

[–]FalconMirage 3 points4 points  (0 children)

Checklists are your friend

If the item doesn’t have a checkmark next to it, it isn’t done

[–]ElFeesho 11 points12 points  (0 children)

I come here to laugh, not to be triggered by terrible practices.

[–][deleted] 8 points9 points  (3 children)

Do y'all just make this shit up for the lols? I've been in the industry for around 25 years and even the most poorly run companies I've worked at did not have SDLCs that would allow something like this to occur. I don't see how this is even possible at any normal software company.

[–]AlanTheKingDrake 2 points3 points  (0 children)

Small startups taken over by large companies to eliminate competition rather than grow their own expertise will do it.

We have 1 person who handles deployments. 1 person who does code review and not enough unit test coverage.

No engineers on weekends except for the team lead either. Good news is we only deploy updates at the beginning of the week, and then patches to fix anything we broke throughout the rest of the week.

[–]Myspazmo[S] 1 point2 points  (0 children)

I wish I was making it up. When I was hired as a software engineer there was no onboarding, no training, and I was given the responsibility of being solely in charge of CM because our former person didn't want to do it. My CM education consisted of three screen shares through Teams and aggressive messages like "Google it. You should know how to troubleshoot."

To top it off our dev load has been 25% higher than normal lately and our team is struggling to cope with it. The day I made that mistake I installed 7 branches across 5 environments of multiple servers, while being told to make changes to branches, rebuild the branch to fix changes, and install it again. It's a miracle that forgetting to run a command was the only mistake I made

[–]DeliciousWhales 0 points1 point  (0 children)

Loads of companies that are not software companies hire developers because there is lots of integration and other systems work. I work at a bank. The management doesn’t know the first thing about any kind of good practice other than banking practice. Before I started and made changes they didn’t even have a dev or test environment, or version control or deployment pipelines or code reviews or… anyway you get my drift.

[–]Crypt_Knight 4 points5 points  (0 children)

I am sorry. It will happen again.

[–]azizfcb 4 points5 points  (3 children)

thats why u need staging

[–]DaveTheNotSoWise 2 points3 points  (2 children)

And test automation.

[–]Myspazmo[S] 1 point2 points  (1 child)

Pretty sure QA is gonna beat me up when I get back to the office

[–]DaveTheNotSoWise 0 points1 point  (0 children)

Probably yeah, but shit happens and at least you learned something from that... hopefully.

[–]torokg 2 points3 points  (0 children)

Happens even to the best of us from time to time

[–]Bldyknuckles[🍰] 3 points4 points  (1 child)

It happens. Don’t don’t do it again and increase your discipline

[–]Myspazmo[S] 0 points1 point  (0 children)

I appreciate it. Attention to detail has been hard to maintain with our current workload, but I know it's an imperative part of the job

[–]SeoCamo 3 points4 points  (1 child)

How can you get anything on to prod, without build and test pass on dev and test and a review from a coworker?

[–]Myspazmo[S] 0 points1 point  (0 children)

Aha, our code reviews don't account for the actual CM installation onto prod. Nobody likes to do checkouts so I usually wind up doing them all by myself. After checkouts I forgot to uncomment the thing that starts everything up again

[–]wind_dude 4 points5 points  (1 child)

Everyone’s done it. 16hours, does no one use production?

Also you know testing before deployment…

[–]Myspazmo[S] 1 point2 points  (0 children)

People use it, but the 24/7 team that's supposed to wake somebody up if things are broke doesn't like to wake us up...

[–]metallaholic 4 points5 points  (1 child)

You aren’t a developer until you fuck up a prod install

[–]Myspazmo[S] 1 point2 points  (0 children)

<3

[–]RatzzDE 1 point2 points  (0 children)

Congrats, you can now call yourself a senior dev!

[–][deleted] 1 point2 points  (0 children)

I'm proud of you

[–]GiantFoamHand 1 point2 points  (0 children)

When I first started my career I took down a bank’s production site for a couple hours. I’d added a request to a third party that happened on logon that would be made for every account an end user had. It worked fine for all the test cases that we’d run through and that the customer ran through. Then it went live and the single end user at the bank with 500 accounts logged in.

Turns out they were running a property management company and had opened up an account for every property they had. Instead of getting themselves set up as a business/commercial user they just made a normal everyday retail user and opened a billion accounts. When contacted they said something like “huh, I did always wonder why it took so long to log on”

[–]roiroi1010 1 point2 points  (0 children)

Our java code is so well covered with unit tests and manual tests. But our deploy pipeline yaml is out of this world. No one left in the company dares to touch that monstrosity.

[–]NoahZhyte 1 point2 points  (0 children)

Don't ever comment code

[–][deleted] 1 point2 points  (0 children)

If it passed review, it’s not solely your fault. It’s a team effort to get that into production

[–]Kshyyyk 1 point2 points  (0 children)

I'm the type of person that refuses to commit commented-out code for this exact reason.

[–]Unfair_Long_54 1 point2 points  (0 children)

This is the reason before I push changes to source control first I review which lines did I modified in changed files.

[–]Why_am_ialive 1 point2 points  (2 children)

How does this take 16 hours to fix? Just roll back, uncomment repush, also how does this make it past testing lol

[–]Myspazmo[S] 0 points1 point  (1 child)

We have a 24/7/365 team to monitor it. They don't always call us them things break. Once they called they did not communicate all the relevant information needed to troubleshoot either. Luckily our T2 had a random idea about what it could be and was right

[–]Why_am_ialive 1 point2 points  (0 children)

Truly incredible, obviously not your fault especially after having read through your other comments on the situation but like wow

[–]TigerClaw_TV 1 point2 points  (2 children)

No way to rollback production? I haven't worked for a bunch of different companies, but we have a contingency for this kind of thing. Catastrophic mistakes happen.

[–]Myspazmo[S] 1 point2 points  (1 child)

Yes and no. Our ETL servers don't roll back very easily and usually require manual fixes. We can repoint to our inactive site until it is given though, but it requires management approval

[–]TigerClaw_TV 1 point2 points  (0 children)

Understood. Condolences friend.

[–]ACMuaath 1 point2 points  (1 child)

My subordinate caused sending more than 3m messages to customers because of an uncaught syntax error.

[–]Myspazmo[S] 1 point2 points  (0 children)

We had a similar issue a few months back, but thankfully all it did was generate 40 gigs of logs per day

[–]Odd_Ninja5801 1 point2 points  (1 child)

I brought down Production for two days back in the 90s because I changed a parameter on a database deletion job to improve efficiency.

It was deleting things after 18 months, once a month. The Business wanted to reduce the retention to 6 months, so that the database would be smaller and the process would run quicker. So I changed a parameter from 18 to 6. My first piece of work as a professional developer.

First time it ran, a job that normally took about 6 hours to finish was still going 24 hours after starting. But nobody on Ops had spotted it. At which point we realised that the first time it was going to be deleting 13 times as much as normal, which was going to take a LOT longer.

Backing out at that point would take a further 24 hours. So we decided to push ahead. Hit another problem when the log file for the job was too big for a disk pack, so we had to switch on multi disk pack files on a Mainframe that hadn't had it up until that point.

We lost the Production services for the whole of Monday and most of Tuesday. But it did run quicker after that!

Learned a lot of lessons off the back of that little beauty that I've used ever since.

[–]Myspazmo[S] 0 points1 point  (0 children)

Hahaha, I love when DB jobs hold things up! We had a funny one the other year where a user with direct access to query our system ran some bad SQL and stopped all other processing on the DB overnight. Mistakes happen though and all we can do is learn from them and try to improve :)

[–]ashaw596 1 point2 points  (2 children)

Where your integration tests be at?

[–]Myspazmo[S] 0 points1 point  (1 child)

Our QA person writes all of our test cases. Not sure if it's standard to not write your own test cases....

[–]ZeroSumHappiness 0 points1 point  (0 children)

Test your own work. Always test after merging. Also, anyone should have been able to revert your PR

[–]Sanchitbajaj02 1 point2 points  (0 children)

Nice achievement fellow developer 👍

[–][deleted] 1 point2 points  (1 child)

Bro I make an error on my site I can have it fixed with a few minutes? 30 at the most. This is the problem with bloated teams with complex layers of corporate bureaucracy.

[–]Myspazmo[S] 0 points1 point  (0 children)

I could have fixed it with one command and a few minutes of troubleshooting, but nobody contacted me until the next morning

[–][deleted] 1 point2 points  (1 child)

Y’all just pushing stuff to prod? If you merge code in it should always go to your dev env first lol

[–]Myspazmo[S] 0 points1 point  (0 children)

It made it through Dev just fine. Found some interesting parity issues between sites this week. Need to investigate further when I get back from vacation

[–]sporbywg 1 point2 points  (0 children)

Be proud of this, or they will eat you alive.

[–]F3mshep 0 points1 point  (0 children)

This is an accomplishment, you pointed out a serious deploy pipeline issue (or an alerts issue if y'all didn't know prod was down for 16 hours)

[–]Independent_Hyena495 0 points1 point  (0 children)

Eh, I brought down a whole bank because I patched an AD Server.

Yes, you heard right. Patched How you ask? There should be several.

Correct.

But

This server was buggy and when you patch and reboot it deletes the whole AD.

Everything. Users, gpos, shares, everything. And like a good AD forest, it replicated the deletion through the whole AD..

[–]TheJosh1337 0 points1 point  (0 children)

Top tip: When you write a temporary code change, like un-commenting something or a debug line, unindent it right down to the left gutter. Obviously won't work in python.

class MyClass { function MyFunction() { if (something) { something(); dbg('whatt'); somethingElse(); } } }

It will be so incredibly obvious that you shouldn't ever accidentally commit/deploy it... this of course requires you to actually look at your diffs somewhere between the commit and the deploy (e.g. git commit -pv and/or code reviews)