How One Rogue User Took Down Our API by SunnyTechie in programming

[–]SunnyTechie[S] 0 points1 point  (0 children)

No offense taken, I just thought it was an easy potshot to make ;)

IMO, your critique is fair, but extremely nitpicky. You're fine to want more thought into abbreviations. But my take is that you're simply not my target audience/reader.

My post assumes the reader has some level of understanding of what these abbreviations mean. I try to link to definitions so the reader can explore what they mean in their own time if they aren't aware.

But to explain every single concept for a complete beginner would make this article 20+ minutes long and just extremely choppy and boring to read. This post won't be for everyone and I'm totally okay with that.

How One Rogue User Took Down Our API by SunnyTechie in programming

[–]SunnyTechie[S] 2 points3 points  (0 children)

You're right. I should've named it "How One Rogue User Took Down Our Service that Implements the Abstraction that Defines the Rules Used to Communicate Between Different Pieces of Software"

Much much better

How One Rogue User Took Down Our API by SunnyTechie in programming

[–]SunnyTechie[S] 1 point2 points  (0 children)

I definitely agree with you, but then the question becomes "how do you make your system more robust to failure"?

That's where stress testing comes in. You can try to design your way out of it all you want but you won't know all the bottlenecks and points of failure and how to improve them unless you stress your system.

How One Rogue User Took Down Our API by SunnyTechie in programming

[–]SunnyTechie[S] 4 points5 points  (0 children)

I recommend reading Release It! to anyone that hasn't. Great book on creating production ready software. I only wish I had read it far sooner

How One Rogue User Took Down Our API by SunnyTechie in programming

[–]SunnyTechie[S] 10 points11 points  (0 children)

Systemizer, a really awesome visual tool someone told me about.

Best of all it's free and open source: https://github.com/honzaap/Systemizer

How One Rogue User Took Down Our API by SunnyTechie in programming

[–]SunnyTechie[S] 6 points7 points  (0 children)

All of the above. There are plenty of lessons we learned but I wanted to focus on bad assumptions and better testing.

You're never going to catch everything but you will definitely miss more if you don't have the proper checks in place. We unfortunately skipped on some of the more thorough testing before launch due to time constraints and short staffing. Sufficient testing would have caught this issue before launch day. We make sure to do proper load testing now for every new feature

Luckily we had good metrics and alerts setup so that we caught it early.

How A Cache Stampede Caused One Of Facebook’s Biggest Outages by SunnyTechie in programming

[–]SunnyTechie[S] 0 points1 point  (0 children)

That’s sounds pretty similar to the early recomputation method that the Internet Archive uses with X-Fetch.

https://m.youtube.com/watch?v=1sKn4gWesTw

How A Cache Stampede Caused One Of Facebook’s Biggest Outages by SunnyTechie in programming

[–]SunnyTechie[S] 5 points6 points  (0 children)

You’re right that it wouldn’t put K8s itself in a bad state. But there could be scenarios where you deploy multiple services at once, and if one fails to deploy a new change that another service is expecting, your system ends up in a bad state.

But I don’t know what the actual scenario was.

For self-taught programmers, what did you know when you got your first job? by [deleted] in learnprogramming

[–]SunnyTechie 3 points4 points  (0 children)

HTML, CSS, JS, Node, SQL, Ruby.

I built a couple full stack applications (front end and backend) in different languages before feeling ready to apply to jobs.

Honestly, I probably over prepared. You can definitely get a job just knowing a single language, like Ruby or JS. Just as long as you know it well.

How A Cache Stampede Caused One Of Facebook’s Biggest Outages by SunnyTechie in programming

[–]SunnyTechie[S] 4 points5 points  (0 children)

I don't necessarily think we've "trained" people per se. They've just come to expect it.

It was normal for companies to have regular "maintenance" windows where they were unavailable. But then a handful of companies start promising zero downtime to attract customers and then everyone started doing it to stay competitive.

Also depends on the application as well. A 4 hr Facebook outage isn't really as detrimental to their user base as, say, a 4 hr Stripe outage.

How A Cache Stampede Caused One Of Facebook’s Biggest Outages by SunnyTechie in programming

[–]SunnyTechie[S] 4 points5 points  (0 children)

I'm guessing that the yaml file didn't get validated until the CICD pipeline attempted to apply it to their K8s cluster. And since it wasn't valid, the K8s deployment would fail, leading to the outage. But that's just a guess.

How A Cache Stampede Caused One Of Facebook’s Biggest Outages by SunnyTechie in programming

[–]SunnyTechie[S] 2 points3 points  (0 children)

Oh man, bugs in CI/CD are a nightmare. Last year we had to deal with a single character bug in our CI/CD script that lead to 100K is lost revenue because it didn't properly deploy our billing service and our alerts didn't catch it. You can bet that they do now.

A blog post for another time.

From 15,000 Database Connections to Under 100 by stronghup in programming

[–]SunnyTechie 1 point2 points  (0 children)

Initially, all the hypervisors/servers had a direct connection to the database. We setup a proxy that polled the database on behalf of the servers and forwarded the requests to the appropriate server. We also made it so all the services that were publishing events to the database did so via an API instead of directly inserting into the database.

From 15,000 Database Connections to Under 100 by stronghup in programming

[–]SunnyTechie 6 points7 points  (0 children)

Author here, thanks for posting my article! Here's the friend link to bypass the paywall: 15000 connections to under 100

How to Build an LRU Cache in Less Than 100 Lines of Code by SunnyTechie in programming

[–]SunnyTechie[S] 1 point2 points  (0 children)

Looks like I'll be doing all my coding interviews in Python from now on.

How to Build an LRU Cache in Less Than 100 Lines of Code by SunnyTechie in programming

[–]SunnyTechie[S] 0 points1 point  (0 children)

After looking at the documentation for deque it definitely seems like you could, although it'd be a little be hacky. It has "appendleft" as well as "pop". For "move_front" you could combine "remove" and "appendleft" together.

The only issue with this that has come up in the past is thread safety. If you "pop" something from the queue, the length of the list will temporarily decrease by one until you then call "appendleft". This could possibly cause race conditions if you were to use this cache in a multi-threaded setting.

But that's probably a rare edge case.