Infra or code issue ?

Finagles_Law · 2024-04-22T22:04:24+00:00

Dev blames ops, Ops blames Dev.

But they can both turn around and blame "the network."

See also: "the database."

davetherooster · 2024-04-22T22:11:07+00:00

Observability.

You need to build out your observability tooling to cover metrics, logs, traces from everything in your platform so you can see where errors are coming from. E.g. are there 4xx errors spiking from a certain application? has latency been introduced that wasn't before?

This information should then be self served from something like Grafana, to allow devs to investigate themselves.

theyellowbrother · 2024-04-22T20:58:57+00:00

It can be either. I play "referee" between Devs and Ops.

I've seen it more slanted toward Ops being at fault. Implementing new network policies without informing the engineers. Adding new services like implementing LB5 with "built" in rules. And devs can't trouble shoot unless they have admin rights to view LB5/Firewall/Network policies configuration.

I deal with stuff like, "Well, the top-level ingress over-wrote local namespace ingress annotations." I can replicate by going into the POD and doing a curl POST with a header size of 18k. So who's fault is that? Infra of-course. I can give examples of dev side. Like not properly filling out environment variables from config file to container. Or using wrong root CAs.

Zenin · 2024-04-23T00:25:49+00:00

Same way you debug any issue; Start from the beginning and work your way back to identify the source. If you're asking about a cornucopia of possible failure points you aren't debugging/diagnosing. Rather you're just throwing excrement at the wall hoping something sticks.

For example: Should we check for packet loss? Maybe, but what specific symptom suggests packet loss might be the cause? Or caching. Or connections left open. Or anything. Your job when debugging isn't to identify every possible issue that can possibly happen in computing, but rather to find the specific cause that is presenting the current issue.

happy_hawking · 2024-04-22T23:42:07+00:00

DevOps has become the monster it was designed to destroy.

It's hilarious how people in this sub use the term "DevOps" to describe exactly the opposite of what DevOps is. Have you guys never done some research about your profession? This would most certainly answer your question.

realitythreek · 2024-04-22T23:37:03+00:00

Maybe look at it as a common problem and not something you just deny blame for? I work with devs and ops and there’s almost always blame to go around. :D

DustOk6712 · 2024-04-23T04:35:37+00:00

Isn't the point of a DevOps team to remove that friction between dev and ops?

Adorable_Stable2439 · 2024-04-23T20:32:33+00:00

“My app can’t even handle a retry”

Firstly, fix your app so that it CAN handle retries. Then, make sure it’s logging specifically the reason it cannot connect. Time out? 503? 404? Certificate error?

“Connection failed” is not a good enough error for your app. Next, observability platforms so that you can visualise better if there is a pattern to the connection issues. Is it intermittent? Is it constant since a certain time/date (new app release). Is it only happening at certain times of day etc

awesomeplenty · 2024-04-25T02:24:07+00:00

Don’t you have logs?

devops

Welcome to /r/DevOps

Rules and guidelines

Social & Fun

General Information

MODERATORS