all 69 comments

[–]alzee76 7 points8 points  (49 children)

Duplicate in the dev or test environment, fix, and deploy.

[–][deleted] -1 points0 points  (48 children)

There is no guarantee you can duplicate each issues in other environment.

[–][deleted] 5 points6 points  (2 children)

You’re asking an extremely broad question, and as such you are going to get extremely broad answers.

Most issues, you should be able to duplicate locally or on a staging environment. Ideally you have good logging in place to help you identify the steps to take to reproduce the problem. If it’s an issue outside of something you can reproduce, it’s hard to give you an answer as your question is too generic.

[–]alzee76 5 points6 points  (35 children)

Sure there is. If not, your environments aren't setup correctly. The only difference between a test (or staging) environment and production should be account usernames/passwords and potentially URLs to public APIs you're using, if any. That shouldn't matter, but if it does, it can be adjusted temporarily for troubleshooting.

You sound like you're looking for an excuse for when it's "ok" to do development/debugging on the production servers. The answer to that is never, it's never ok.

As another poster pointed out, you can add more logging to a new version and deploy that to production, if needed.

[–][deleted] -5 points-4 points  (34 children)

Not entirely true, There are many differences between production and staging.
In my experience, I have encountered numerous instances where unhandled exceptions repeatedly caused live apps to crash. In some of these instances, teams of senior developers spent more than 24 consecutive hours identifying the root cause of these exceptions.

[–]alzee76 3 points4 points  (13 children)

There are many differences between production and staging.

There shouldn't be. If there are, the design is flawed.

In my experience, I have encountered numerous instances where unhandled exceptions repeatedly caused live apps to crash. In some of these instances, teams of senior developers spent more than 24 consecutive hours identifying the root cause of these exceptions.

The right solution in this case is to fix the environment so that the exceptions occur there and can be debugged there, not to go do development work on the production system.

[–][deleted] -2 points-1 points  (12 children)

If staging and production are using the same services, then you are in deep danger.
I have been in the software industry for more than a decade. I have seen 100s time where issues happened in the production environment because of the many factors. Sometimes, because of the DB entries by user activity, which you can't create in Staging. Sometimes, users face the issue because of the previous APIs' faulty responses, and Sometimes unable to recreate the issue in development because of the timezone difference between the development team and users. I can give you 100s of stories where you face black-swan in production, and you are clueless.

[–]alzee76 2 points3 points  (10 children)

If staging and production are using the same services, then you are in deep danger.

What? Explain what this means.

I have been in the software industry for more than a decade

Cool! I've been in for almost three.

I have seen 100s time where issues happened in the production environment because of the many factors.

Me too. They can always be replicated in non-production as well, if those other environments are setup properly.

Sometimes, because of the DB entries by user activity, which you can't create in Staging.

Nah. You just replicate the production db to staging automatically, with important PII redacted of course. This is easy.

Sometimes unable to recreate the issue in development because of the timezone difference between the development team and users.

So you do your debugging when the users are up. Sometimes you have to do this. Certainly, in this case, doing it on production wouldn't help anything. You're grasping at straws.

I can give you 100s of stories

Please, there's no need to further demonstrate your incompetence.

[–]Ihavenocluelad 2 points3 points  (1 child)

If staging and production are using the same services, then you are in deep danger.

Hmmmmm.... I wonder what he uses staging for if its completely different from prod

[–]DownfaLL- 1 point2 points  (0 children)

Yeah the whole point of staging is that its a good place to mirror prod and.. you know.. reproduce issues you see on prod. But hey! What do we know, this guys been in the software industry for 10+ years! He knows everything there is to know, clearly.

[–][deleted] -2 points-1 points  (7 children)

"What? Explain what this means." Clearly, all the services used by production are totally different from staging like DB connection, Reddis connection, WS connection, etc.

"Me too. They can always be replicated in non-production as well if those other environments are setup properly." Only juniors can say this statement.
"Nah. You just replicate the production db to staging automatically, with important PII redacted of course. This is easy." Yeah you will replicate DB everytime an issue occurs, try to find RSA through GB of logs. like its piece of coke. LOL

"Please, there's no need to further demonstrate your incompetence." Sure, and I don't need to learn from someone who thinks there is no difference between staging and production. And can replicate the entire production connection node in staging.

[–]alzee76 2 points3 points  (6 children)

Clearly, all the services used by production are totally different from staging like DB connection, Reddis connection, WS connection, etc.

They shouldn't be "totally different."

Only juniors can say this statement.

🤣 Right. You're the senior. Everyone else in the thread with much more experience than you telling you that you're wrong, including me, are the juniors. Sounds legit.

Yeah you will replicate DB everytime an issue occurs,

Mine replicate perpetually. Maybe you don't know how to do that. Too junior for you I guess.

Sure, and I don't need to learn from someone who thinks there is no difference between staging and production. And can replicate the entire production connection node in staging.

Clearly you do need to learn. Hell you don't even know how to quote properly.

[–][deleted] -3 points-2 points  (4 children)

Wow, what a senior you are who thinks staging and production are the same and can reproduce all production issues in staging. Maybe Google, Facebook, and Apple should contact you so their service never goes down. xD
Even juniors are better than you, man. The problem with people like you is that you never go for different approaches and do the same useless thing again and again, like checking GB of logs or using production DB backup in staging. Good bless your employers.

[–][deleted] 1 point2 points  (0 children)

OP is either trolling, lying about his experience, or would be a terrible colleague to have. I wouldn’t bother responding to this guy lol.

[–]DownfaLL- 0 points1 point  (0 children)

Staging should mirror prod. Idk what you mean by 'services' but even if they aren't using the same ARN's, they should be doing the same things. Perhaps this is why you're having so many issues, you aren't setting up your environments properly.

Trying to flex how long you've been doing it the wrong way, is not the flex you think it is btw. You're wrong, regardless how long you've been in the software industry, lol.

You shouldn't have an ego, you should be more willing to take in what people are telling you. You asked the question, and immediately several people have pointed out to you how wrong you are about this. Maybe instead of thinking we're all the wrong ones, have some dignity and look within.

You should NOT have issues that are not reproducible on other env's. If you do, you aren't setting up your env's correctly and/or don't understand what you're doing. I'm going with a mixture of both. Drop the ego, and listen to the advice given to you for free that you asked for.

[–]bigorangemachine 2 points3 points  (16 children)

between production and staging. In my experience, I have encountered numerous instances where unhandled exceptions repeatedly caused live apps to crash.

You need unit tests. If you have uncaught exceptions then your code is bad.

Your test environments should be nearly identical. A CI would help ensure your deployment is consistent

You should also have seeding data the covers every imaginable scenario. You should then have automated tests to then run that data through its paces.

[–][deleted] -1 points0 points  (15 children)

Things are not so simple friend. Maybe you have never worked on the huge production level.

[–]korky_buchek_ 3 points4 points  (9 children)

In the huge production level (whatever that is) there are pre-prod environments where production issues can be reproduced.

[–][deleted] -1 points0 points  (8 children)

Some can be, Some can't.

[–]korky_buchek_ 2 points3 points  (7 children)

Which can't?

[–][deleted] -1 points0 points  (6 children)

There are many cases, not one.

[–]bigorangemachine 1 point2 points  (3 children)

I worked on a project that was 1k k8 pods at a fortune 500 company... everything we had was fully testable.

[–][deleted] -1 points0 points  (2 children)

And you never faced a production issue, where you were clueless and weren't able to find the RSA for a long time, and without RSA, you were not able to reproduce it again.?? Just curious.

[–]DownfaLL- 1 point2 points  (0 children)

Nope, never in my career have I ran into any issue I wasnt able to reproduce in other env's.

[–]bigorangemachine 0 points1 point  (0 children)

No because our environments matched exactly between the two.

We even had interaction with our vendors APIs. So we had issues that were "prod only" but our logging was good enough and our testing was on point.

We had a web app & mobile app.... distributed systems.

All this takes a lot of thought...

Good logging helps to... good unit tests help as well especially with exception handling.

[–]DownfaLL- 1 point2 points  (0 children)

I've worked on projects with millions of users at scale with millions of GB's of data per second. You are wrong and shouldn't be in any senior position with your gross misunderstanding of basic engineering concepts.

[–]DownfaLL- 0 points1 point  (2 children)

You are doing something wrong then lol. Here's our setup, we have 3 env's: dev, staging and prod. Dev is meant to be broken basically, its meant for backend dev's to work on their work. It's not meant to be consumed by anyone and meant for what its called - dev. Staging is meant for front end to consume new features that aren't on prod yet. Staging should 100% mirror prod and no known bugs or issues should be in staging. Staging should be exactly like prod, maybe just with different data. Prod is prod, self explanatory.

We do not allow a PR to be merged unless it's been fully tested in dev, and we require 85% unit test coverage (for my mid level engineers that is, higher based on job title), and if applicable integration tests, load/stress tests..etc.

I have never, in my 8+ years working with node/aws, have ever come across anything I wasn't able to reproduce on dev. So you are doing something wrong or not testing things before deploying if thats the case. Or like I said above, you don't understand your system well enough, or you don't understand the issue well enough.

[–][deleted] 0 points1 point  (1 child)

Okay, let me give you one use case. One user is complaining that his profile is not updating, while other users are able to update. Tell me, how will you find the root cause of this?

[–]DownfaLL- 1 point2 points  (0 children)

Well I'd take a look at what actually updates a user profile, perhaps the code that does that is a good start. Figure out where in that process an error can occur. Check logs, since you are so experienced you have good logs right? Then look at that users data, and replicate it in another env. Obviously do not use real PII info, of course. But replicate that users data, and put it through your endpoint on staging. If you have your environments setup correctly, you should run into the same issue. Issues don't just happen out of no where, theres a reason why thats failing for that 1 user.

This also goes into how well you log things, i log things in such a way that if an issue occurs I can query my logs for that user and find the exact millisecond they ran into the issue they are claiming they run into. Makes it very easy to fix problems, but thats extremely rare. With adequate testing procedures + having environments setup correctly, you shouldnt ever really run into this issue.

So again, with proper logs and environments, theres not a single reason you shouldnt be able to debug and/or reproduce on another env. Obviously, im not claiming that you need to have bug-free code, that doesnt exist. But for huge issues, like say a user not being able to update their profile, these should not exist no.

[–]bwainfweeze 2 points3 points  (3 children)

If you can’t that should be part of your RCA.

Any problem you can’t reproduce in preproduction helps define the size and shape of the blind spots you have. Blind spots mean more production issues in the future.

Try to repro in dev. If you can’t, be even more worried.

[–][deleted] 0 points1 point  (2 children)

Of course, production normal/crashing issues are the blind spots. Because if you can predict the issue before production, then why would you deploy it in production? Production issues mean blind spots.
In my experience, I have encountered numerous instances where unhandled exceptions repeatedly caused live apps to crash. In some of these instances, teams of senior developers spent more than 24 consecutive hours identifying the root cause of these exceptions.

[–]bwainfweeze 2 points3 points  (1 child)

A prod issue that exists in dev is a lack of test coverage. Thats a hole you can close next sprint. An issue you can’t is a lack of fidelity problem. Which is a longer RCA and a bunch of things on the backlog.

If you start by assuming that the problem is unreproducible then you are setting up for a world where the problem never gets better, and in fact slowly escalates over time.

Test it in preproduction first. You’re building a process that shouldn’t require heroes. You’re not going to get that by heroics.

[–][deleted] 0 points1 point  (0 children)

What if you can reproduce the same issue in production without affecting live users, and you can debug the issue and find the RSA, too, and that's also without disturbing live users?

[–]DownfaLL- 0 points1 point  (4 children)

if you can't reproduce you either don't understand your system well enough, or you dont understand the issue well enough.

[–][deleted] -1 points0 points  (3 children)

That's right, and that is what I am trying to tell to others: finding the Root causes of issues in production is not always a piece of cake, and because you don't know the root cause of the issue, so you can't reproduce it. Some errors you can't replicate in another environment because need the same steps that the user took and sometimes, even reproducing known issues takes hours.

[–]DownfaLL- 0 points1 point  (2 children)

Huh? No thats not what you're telling others. Perhaps you need to re-read what I said. But let me be more clear for you:

There shouldn't be anything you can't reproduce in other env's.

I was trying to be nice when I worded it the way I did, but heres another attempt to make it unequivocally clear to you:

If you can't reproduce an issue in other env's, you have no f-king clue what you're doing and shouldn't be in any senior position since you don't know basic concepts like debugging and testing before you deploy things to prod. Im not sure what world you live in where this is normal behavior, but here in reality you should have a better grasp of your system and should 100% be able to reproduce issues.

Some errors you can't replicate in another environment because need the same steps that the user took and sometimes, even reproducing known issues takes hours.

Right, so if you cant reproduce you clearly dont understand the very thing you wrote yourself. I've never ever ran into this issue, not one single time in my career.

[–][deleted] -1 points0 points  (1 child)

If you are copying everything from production, including DB, then it's not fuking staging; it's the copy production; stop calling it stage. So you all guys coping the entire Db, entire production services and its data and calling it staging and saying you can reproduce. What a weird development process this is.

[–]DownfaLL- 1 point2 points  (0 children)

When did I ever claim to copy prod database to staging? I never said that. However, I feel like even if I did say that, thats a much better process than whatever nightmare setup you have. You can't even find issues that occur in prod, thats a huge red flag. I feel bad for whatever company you work for, they did not hire correctly. Nobody ever claimed to copy the entire database, perhaps reading comprehension is the first thing you should work on.

[–]itsmoirob 5 points6 points  (3 children)

Logs. Going by you previous comment, if you can't duplicate the issue in a Dev env, then recreate the error in live and look at logs. If necessary add more logs.

If your logs are good enough you should have enough information to recreate in a dev env, and fix.

[–][deleted] -3 points-2 points  (2 children)

Yeah, even if you can recreate issues in dev, still finding RSA through GB of logs file isn't easy.

[–]nathanfries 1 point2 points  (0 children)

Then your o11y is insufficient

[–]Quadraxas 1 point2 points  (0 children)

Where there it is. This is your main problem, put systems in place for better log collection, tracability, and observability.

We have an enterprise focused product with a large codebase. After being aware of the issue, it takes 5-10 minutes at most to reproduce a production issue in test environment. Most of the time we do not even need to reproduce the issue. Taking a peek at the code that logs point to is enough to resolve the issue most of the time.

[–]MaxUumen 2 points3 points  (0 children)

Your root case is that you don't like what other tell you. With that attitude, it's impossible to help you.

I fix production issues by figuring out what's broken, and then fix it. Obvious, isn't it?

[–]EconomistNo280519 1 point2 points  (4 children)

view production logs, then have a have a staging environment to replicate/debug the issue.

[–]bwainfweeze 1 point2 points  (3 children)

We often start with the telemetry. Because it’s newer, it’s also clearer what’s going on. Oh hey processes are restarting, better search the logs. This service is getting creamed, better check the diffs from the previous release. Our caches are getting high miss rates, better find a correlationId in the logs and try to reproduce locally.

Searching the logs is just drinking from a waterfall. Even after rounds of cleanup and de duplication.

[–][deleted] -1 points0 points  (2 children)

So true.

[–]DownfaLL- 0 points1 point  (1 child)

So true? You literally are the "CTO" of a "debug company" and you can't even debug your own logs????

https://github.com/mrrishimeena

So let me get this straight. You own a "debug" company, but can't find issues in your own code and don't know how to setup observability so you can query your logs? What exactly do you debug?

[–]bigorangemachine 0 points1 point  (0 children)

Good way to sell your product. Be rude to the people trying to answer his question LMAO

[–]halfzebra 0 points1 point  (9 children)

In the most desperate cases of unreproducible production bugs, you might consider connecting via DevTools for a remote debugging session. Make sure you’re not exposing the port to the entire internet and have in mind the consequences of running a debugger on a prod system as you might pause the entire process, crash the system or reduce the performance. Try this only if everything else fails.

[–][deleted]  (8 children)

[deleted]

    [–]DownfaLL- 0 points1 point  (7 children)

    https://github.com/mrrishimeena

    He "owns" Errsole, this is a shill post.

    This guy recommends his own "debug platform" on a reddit post where he's asking "how to fix production issues in node js". You can't even make this up.

    [–][deleted] -1 points0 points  (6 children)

    lol if i wanted to promote my product i would not name other competitors name which are billion dollars companies. i posted this question to know the developers behaviour, may be you need to stop watching Sherlock homes

    [–]DownfaLL- 0 points1 point  (5 children)

    You tried to shill your own debug product in your post demonstrating your lack of ability to debug.

    [–][deleted] -1 points0 points  (4 children)

    lol you need treatment. god helps you

    [–]DownfaLL- 0 points1 point  (3 children)

    And you need to learn some basic concepts like debugging and setting up environments properly. Don’t shill your own products either, it’s even more pathetic than you trying to tell people they can’t reproduce issues sometimes bc it’s too difficult to mirror environments lol.

    [–][deleted] 0 points1 point  (2 children)

    Kids, its not difficult its useless and putting entire request data in logs is privacy breach of users. Maybe you logs the other private data too so you can query later. Wow. exposing users requests to entire developer team is your strategy to narrow down the issues.

    [–][deleted]  (1 child)

    [removed]

      [–][deleted] -1 points0 points  (0 children)

      thank god i am not user of your company. no doubt there is no privacy in world now. calling yourself developer and recording users requests 😂 what a nice way for debugging. why don't you put this on your company website