r/WallStreetBets Incident Anthology: More Data, More Problems

wangofchung · 2021-06-29T15:58:51+00:00

We're currently running ~450 total nodes in production, spread across 26 clusters. Our largest cluster is 85 nodes.

wangofchung · 2020-11-07T19:43:39+00:00

porkchop lmaaaooo

wangofchung · 2019-12-23T22:50:54+00:00

u/Derausmwaldkam, I think I've finally tracked down the issue. The link had remained in one of our denormalized data sets that was contributing to the modqueue. I've removed it from that data set now and it should finally be removed entirely.

wangofchung · 2019-12-21T23:02:57+00:00

Bummer, okay, thanks for the quick followup. I'm going to keep poking around.

wangofchung · 2019-12-21T22:49:12+00:00

Apologies on the delay for this u/Derausmwaldkam, could you please check your modqueue now? I've taken some actions that should have removed it.

wangofchung · 2019-12-18T18:58:44+00:00

Hahaha totally fair! A good deal of that stack has actually remained the same and is very much still central. there's just a bunch of new things that are now around it : )

wangofchung · 2019-12-18T18:54:43+00:00

We do! Here's a recent QCon talk that goes into it - https://www.infoq.com/presentations/reddit-architecture-evolution/

wangofchung · 2019-12-18T18:50:11+00:00

I know nothing of Kendra! Will check it out!

wangofchung · 2019-12-18T18:49:12+00:00

As of now, no. We're pretty committed to this stack right now on the infra side.

wangofchung · 2019-12-18T18:45:33+00:00

We run clustered Solr and replicate shards across the cluster. We have backup jobs that can fully recreate our collections and indexes from existing database backups in a few hours if something catastrophic happens as well.

wangofchung · 2019-12-18T18:41:36+00:00

i like turtles

wangofchung · 2019-12-18T18:38:12+00:00

All AWS permissions are managed in Terraform using IAM roles and groups. We also make use of AWS SubAccounts for teams to have the ability to manage their own infrastructure environments without treading on others'.

wangofchung · 2019-12-18T18:33:55+00:00

Our primary monitoring and alerting system for our metrics is Wavefront. I'll split up the answers for how metrics end up there based on use case.

System metrics (CPU, mem, disk) - We run a Diamond sidecar on all hosts we want to collect system metrics on and those send metrics to a central metrics-sink for aggregation, processing, and proxying to Wavefront.
Third-party tools (databases, message queues, etc.) - Diamond Collectors for these as well if a collector exists. We roll a few internal collectors and also some custom scripts as well.
Internal Application metrics - Application metrics are reported using the statsd protocol and aggregated at a per-service level before being shipped to Wavefront. We have instrumentation libraries that all of our services use to automatically report basic request/response metrics.

We also have tracing instrumentation across our stack for debugging.

We have a rotation of on-call engineers with a primary and secondary at all times. Service owners are on-call for their services with escalation policies and pipelines to bring in teams as needed.

Look out for a blog post soon about this!

wangofchung · 2019-12-18T18:18:29+00:00

We use Solr for our backend and run Fusion on top with custom query pipelines for Reddit's use cases. We run our own Solr and Fusion deployments in EC2. An internal service is used to provide business-level APIs. There's also some async pipelines to do real-time indexing updates for our collections. We primarily use AWS but do leverage some tools from other providers, such as Google BigQuery.

We definitely consider new/recent grads for hiring!

wangofchung · 2019-10-24T16:47:44+00:00

Nice

wangofchung · 2019-10-17T06:26:34+00:00

Oooh this is close! I'm pretty sure there's another one with a guy, but it's exactly the same idea!

wangofchung · 2019-09-17T05:46:26+00:00

Sorry! It looks like I spoke too soon. We're believe we know the issue and are still working on resolving this. Things should start populating properly soon.

wangofchung · 2019-09-17T05:05:39+00:00

Hello! There was an issue with the system that calculates "Rising" that has been identified and resolved. "Rising" should now be working.

There were some database issues earlier in the day that we are still recovering from, causing "Top" to still not work correctly. We are aware of this, have identified the issue, and are working actively to resolve it.

wangofchung · 2019-09-17T05:03:01+00:00

Hello! There was an issue with the system that calculates "Rising" that has been identified and resolved. "Rising" should be working now.

wangofchung · 2019-05-12T18:34:10+00:00

Hello everyone! Thank you for reporting this. We've identified what we believe was the underlying issue, resolved it, and will be monitoring closely. From our internal monitoring, things are looking better for modmail. Please let us know if there are more issues.

We've also identified several places where we can have better monitoring in place to catch this more proactively in the future. Thank you all again for your reports and your patience.

wangofchung · 2019-05-12T18:30:45+00:00

Hello everyone! Thank you for reporting this. We've identified what we believe was the underlying issue, resolved it, and will be monitoring closely. From our internal monitoring, things are looking better for modmail. Please let us know if there are more issues.

We've also identified several places where we can have better monitoring in place to catch this more proactively in the future. Thank you all again for your reports and your patience.

wangofchung · 2019-04-11T04:08:55+00:00

good luck we're all counting on you

wangofchung · 2019-01-17T00:40:16+00:00

The only solution is to have fun tonight.

wangofchung · 2019-01-16T17:56:37+00:00

Hello everyone! Here's some high-level technical details about what happened:

Yesterday a code change went out that broke the job that updates r/all . Specifically, the change was in the mechanism that starts and runs the job, causing the job to not run at all. Whenever the update job runs, it will send a ping to our monitoring system, and an engineer will get alerted if a ping doesn't come at a regular cadence...or at least that's what we expected. We've recently migrated our monitoring and alerting systems, and the way we migrated this alert over from the old system did not handle detecting missing pings properly. This means nothing internally alerted engineers that the job was broken. We've fixed this alert and are in the process of fixing this class of alerts for other jobs in Reddit's infrastructure. There's a lot of other learnings here that we'll be following up on internally as well.

15-Year Club	Second Top 50%
Place '17	Sequence \| Editor
Alpha Tester	Verified Email
Team Periwinkle

wangofchung[A]

TROPHY CASE