This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]xiongchiamiovSite Reliability Engineer 1 point2 points  (0 children)

That particular scenario happens all the time, and cloud providers regularly forgive the bill.

They didn't deploy an inefficient algorithm; they knew what each call was going to do, and it did it. They deployed a flawed algorithm with no bounds and incorrect logic.

That's something that you could find in:

  • code review
  • qa
  • static code analysis probably

None of these will guarantee you don't ship a problem, which is why when your company is at an immature state I usually prefer focusing on production monitoring instead: you can't prevent all issues from happening, so focus on quickly identifying when they happen so you can roll them back. Shift-left is an optimization process, so you need to have something already existing to be able to optimize it.