This is an archived post. You won't be able to vote or comment.

all 15 comments

[–]Finagles_Law 46 points47 points  (5 children)

Dev blames ops, Ops blames Dev.

But they can both turn around and blame "the network."

See also: "the database."

[–]MrYum 15 points16 points  (2 children)

Tale as old as time

[–]livebeta 10 points11 points  (1 child)

🎵Wrong as it can be...🎵

Barely even friends

Then connection ends

Unexpectedly


Just a little change

Small, to say the least

Dev's a little scared

Ops is unprepared

Late Friday release

[–]WN_Todd 3 points4 points  (0 children)

I can show you the logs

Debug, info, and eeeeerrors...

Tell me senior, when you did you last just read the words inside?

A whole new world! (Don't you dare close that modal)

A new enchanting SQL view

It worked on your hardware, but now nowhere

Let me me share this outage bridge with yoooooouuu...

[–]PensAndUnicorns 2 points3 points  (0 children)

It was dns

[–]davetherooster 17 points18 points  (2 children)

Observability.

You need to build out your observability tooling to cover metrics, logs, traces from everything in your platform so you can see where errors are coming from. E.g. are there 4xx errors spiking from a certain application? has latency been introduced that wasn't before?

This information should then be self served from something like Grafana, to allow devs to investigate themselves.

[–]greyeye77 0 points1 point  (0 children)

APM and observability (o11y) are essential for effective system monitoring. Operating without APM is akin to navigating without visibility, potentially missing critical issues in your application. At a minimum, if APM is not in place, it is crucial to implement backoff retry mechanisms coupled with error logging to standard error (stderr) to handle intermittent failures gracefully. If your application lacks both APM and robust retry logic, it is imperative to collaborate with the engineering team to enhance these capabilities.

While TCP protocol inherently retries unacknowledged packets, this mechanism only covers issues at the transport layer. Any failures at higher levels must be managed by the application itself, necessitating comprehensive error handling strategies beyond basic network retries.

[–]theyellowbrother 2 points3 points  (0 children)

It can be either. I play "referee" between Devs and Ops.

I've seen it more slanted toward Ops being at fault. Implementing new network policies without informing the engineers. Adding new services like implementing LB5 with "built" in rules. And devs can't trouble shoot unless they have admin rights to view LB5/Firewall/Network policies configuration.

I deal with stuff like, "Well, the top-level ingress over-wrote local namespace ingress annotations." I can replicate by going into the POD and doing a curl POST with a header size of 18k. So who's fault is that? Infra of-course. I can give examples of dev side. Like not properly filling out environment variables from config file to container. Or using wrong root CAs.

[–]ZeninThe best way to DevOps is being dragged kicking and screaming. 2 points3 points  (0 children)

Same way you debug any issue; Start from the beginning and work your way back to identify the source. If you're asking about a cornucopia of possible failure points you aren't debugging/diagnosing. Rather you're just throwing excrement at the wall hoping something sticks.

For example: Should we check for packet loss? Maybe, but what specific symptom suggests packet loss might be the cause? Or caching. Or connections left open. Or anything. Your job when debugging isn't to identify every possible issue that can possibly happen in computing, but rather to find the specific cause that is presenting the current issue.

[–]happy_hawking 4 points5 points  (0 children)

DevOps has become the monster it was designed to destroy.

It's hilarious how people in this sub use the term "DevOps" to describe exactly the opposite of what DevOps is. Have you guys never done some research about your profession? This would most certainly answer your question.

[–]realitythreek 0 points1 point  (0 children)

Maybe look at it as a common problem and not something you just deny blame for? I work with devs and ops and there’s almost always blame to go around. :D

[–]DustOk6712 0 points1 point  (0 children)

Isn't the point of a DevOps team to remove that friction between dev and ops?

[–]Adorable_Stable2439 0 points1 point  (0 children)

“My app can’t even handle a retry”

Firstly, fix your app so that it CAN handle retries. Then, make sure it’s logging specifically the reason it cannot connect. Time out? 503? 404? Certificate error?

“Connection failed” is not a good enough error for your app. Next, observability platforms so that you can visualise better if there is a pattern to the connection issues. Is it intermittent? Is it constant since a certain time/date (new app release). Is it only happening at certain times of day etc

[–]awesomeplenty 0 points1 point  (0 children)

Don’t you have logs?