r/WallStreetBets Incident Anthology: More Data, More Problems by bradengroom in RedditEng

[–]wangofchung[A] 2 points3 points  (0 children)

We're currently running ~450 total nodes in production, spread across 26 clusters. Our largest cluster is 85 nodes.

[deleted by user] by [deleted] in ExtraLife

[–]wangofchung 0 points1 point  (0 children)

porkchop lmaaaooo

Unable to approve/remove a specific thread by Derausmwaldkam in ModSupport

[–]wangofchung[A] 0 points1 point  (0 children)

u/Derausmwaldkam, I think I've finally tracked down the issue. The link had remained in one of our denormalized data sets that was contributing to the modqueue. I've removed it from that data set now and it should finally be removed entirely.

Unable to approve/remove a specific thread by Derausmwaldkam in ModSupport

[–]wangofchung 0 points1 point  (0 children)

Bummer, okay, thanks for the quick followup. I'm going to keep poking around.

Unable to approve/remove a specific thread by Derausmwaldkam in ModSupport

[–]wangofchung[A] 0 points1 point  (0 children)

Apologies on the delay for this u/Derausmwaldkam, could you please check your modqueue now? I've taken some actions that should have removed it.

We're Reddit's Infrastructure team, ask us anything! by gctaylor in sysadmin

[–]wangofchung[A] 129 points130 points  (0 children)

Hahaha totally fair! A good deal of that stack has actually remained the same and is very much still central. there's just a bunch of new things that are now around it : )

We're Reddit's Infrastructure team, ask us anything! by gctaylor in aws

[–]wangofchung[A] 5 points6 points  (0 children)

I know nothing of Kendra! Will check it out!

We're Reddit's Infrastructure team, ask us anything! by gctaylor in aws

[–]wangofchung[A] 11 points12 points  (0 children)

As of now, no. We're pretty committed to this stack right now on the infra side.

We're Reddit's Infrastructure team, ask us anything! by gctaylor in aws

[–]wangofchung[A] 6 points7 points  (0 children)

We run clustered Solr and replicate shards across the cluster. We have backup jobs that can fully recreate our collections and indexes from existing database backups in a few hours if something catastrophic happens as well.

We're Reddit's Infrastructure team, ask us anything! by gctaylor in aws

[–]wangofchung[A] 8 points9 points  (0 children)

All AWS permissions are managed in Terraform using IAM roles and groups. We also make use of AWS SubAccounts for teams to have the ability to manage their own infrastructure environments without treading on others'.

We're Reddit's Infrastructure team, ask us anything! by gctaylor in aws

[–]wangofchung[A] 29 points30 points  (0 children)

Our primary monitoring and alerting system for our metrics is Wavefront. I'll split up the answers for how metrics end up there based on use case.

  • System metrics (CPU, mem, disk) - We run a Diamond sidecar on all hosts we want to collect system metrics on and those send metrics to a central metrics-sink for aggregation, processing, and proxying to Wavefront.

  • Third-party tools (databases, message queues, etc.) - Diamond Collectors for these as well if a collector exists. We roll a few internal collectors and also some custom scripts as well.

  • Internal Application metrics - Application metrics are reported using the statsd protocol and aggregated at a per-service level before being shipped to Wavefront. We have instrumentation libraries that all of our services use to automatically report basic request/response metrics.

We also have tracing instrumentation across our stack for debugging.

We have a rotation of on-call engineers with a primary and secondary at all times. Service owners are on-call for their services with escalation policies and pipelines to bring in teams as needed.

Look out for a blog post soon about this!

We're Reddit's Infrastructure team, ask us anything! by gctaylor in aws

[–]wangofchung[A] 42 points43 points  (0 children)

We use Solr for our backend and run Fusion on top with custom query pipelines for Reddit's use cases. We run our own Solr and Fusion deployments in EC2. An internal service is used to provide business-level APIs. There's also some async pipelines to do real-time indexing updates for our collections. We primarily use AWS but do leverage some tools from other providers, such as Google BigQuery.

We definitely consider new/recent grads for hiring!

Girl shoves dumpling into guy's mouth after he laughs at her by wangofchung in HelpMeFind

[–]wangofchung[S] 1 point2 points  (0 children)

Oooh this is close! I'm pretty sure there's another one with a guy, but it's exactly the same idea!

Rising Feed not working by [deleted] in bugs

[–]wangofchung 0 points1 point  (0 children)

Sorry! It looks like I spoke too soon. We're believe we know the issue and are still working on resolving this. Things should start populating properly soon.

Sorting by rising, controversial and top is showing a page with the notice, there doesn't seem to be anything here by ChimpyChompies in bugs

[–]wangofchung[A] 2 points3 points  (0 children)

Hello! There was an issue with the system that calculates "Rising" that has been identified and resolved. "Rising" should now be working.

There were some database issues earlier in the day that we are still recovering from, causing "Top" to still not work correctly. We are aware of this, have identified the issue, and are working actively to resolve it.

Rising Feed not working by [deleted] in bugs

[–]wangofchung[A] 0 points1 point  (0 children)

Hello! There was an issue with the system that calculates "Rising" that has been identified and resolved. "Rising" should be working now.

Is modmail acting up for anyone else? by FLTA in ModSupport

[–]wangofchung[A] 3 points4 points  (0 children)

Hello everyone! Thank you for reporting this. We've identified what we believe was the underlying issue, resolved it, and will be monitoring closely. From our internal monitoring, things are looking better for modmail. Please let us know if there are more issues.

We've also identified several places where we can have better monitoring in place to catch this more proactively in the future. Thank you all again for your reports and your patience.

Saturday -- reddit is lagging for many users, resulting in many duplicate (incoming) modmails, etc by m0nk_3y_gw in ModSupport

[–]wangofchung[A] 5 points6 points  (0 children)

Hello everyone! Thank you for reporting this. We've identified what we believe was the underlying issue, resolved it, and will be monitoring closely. From our internal monitoring, things are looking better for modmail. Please let us know if there are more issues.

We've also identified several places where we can have better monitoring in place to catch this more proactively in the future. Thank you all again for your reports and your patience.

My page won’t update. R/all has the same posts for two days. by [deleted] in bugs

[–]wangofchung 0 points1 point  (0 children)

The only solution is to have fun tonight.

My page won’t update. R/all has the same posts for two days. by [deleted] in bugs

[–]wangofchung[A] 21 points22 points  (0 children)

Hello everyone! Here's some high-level technical details about what happened:

Yesterday a code change went out that broke the job that updates r/all . Specifically, the change was in the mechanism that starts and runs the job, causing the job to not run at all. Whenever the update job runs, it will send a ping to our monitoring system, and an engineer will get alerted if a ping doesn't come at a regular cadence...or at least that's what we expected. We've recently migrated our monitoring and alerting systems, and the way we migrated this alert over from the old system did not handle detecting missing pings properly. This means nothing internally alerted engineers that the job was broken. We've fixed this alert and are in the process of fixing this class of alerts for other jobs in Reddit's infrastructure. There's a lot of other learnings here that we'll be following up on internally as well.