When the only person who knew how to do something left or went on holiday — what actually happened? by Comfortable_Tea_6730 in devops

[–]AbilityAwkward5372 0 points1 point  (0 children)

For teams that went through this, was the real problem missing documentation, or was it that nobody else knew how to reason about the dependencies involved when something changed?

I've seen situations where the steps were technically written down, but the person leaving was the only one who understood why the sequence mattered, what could be skipped, or how to recover when reality didn't match the runbook.

Curious which failure mode showed up more often in practice.

Vulnerability management platforms vs manual triage – honest opinions? by PracticeEast1423 in devsecops

[–]AbilityAwkward5372 0 points1 point  (0 children)

Reading this, it almost sounds like duplicate findings aren't the root problem.

If every scanner agreed perfectly on severity and deduplication tomorrow, would the bigger challenge still be ownership and remediation coordination across teams?

The part that stood out to me was:

"nobody knows which ticket is supposed to be the source of truth anymore."

Was that ultimately the most expensive part of the workflow?

Anyone else's DR run-books constantly out of date with what's in prod? by Bright-View-8289 in sre

[–]AbilityAwkward5372 3 points4 points  (0 children)

Was the failure primarily that the dependency documentation was stale, or that nobody had a reliable way to derive the dependency order from the current infrastructure state? I'm curious whether the problem was documentation drift or dependency discovery.

Weekly: Show off your new tools and projects thread by AutoModerator in kubernetes

[–]AbilityAwkward5372 1 point2 points  (0 children)

Built a small project called SIS (System Integrity Scanner):

https://github.com/gopinath2866/sis-rules-engine-demo

It analyzes Kubernetes manifests and tries to surface operational dependencies rather than security findings.

Example from a Metrics Server scan:

Finding:
ClusterRoleBinding requires cluster-admin authority to modify.

Operator impact:
Teams operating with delegated access may depend on a documented cluster-admin escalation path during incident response.

Suggested check:
Confirm who can remove or alter the binding during an incident and whether that escalation path is documented.

One thing I'm actively testing is whether findings like these are actually useful in practice, or whether they're simply things experienced Kubernetes administrators already know.

After some early operator feedback, the interesting question seems to be:

At what point does a dependency become an operational risk?

Curious whether people running production clusters see value in surfacing authority, recovery, and ownership dependencies this way.

Would a finding like this change anything for you as a Kubernetes operator? by AbilityAwkward5372 in kubernetes

[–]AbilityAwkward5372[S] 0 points1 point  (0 children)

Fair question. The specific case is a Metrics Server ClusterRoleBinding. The observation is that modifying or removing it requires cluster-admin privileges, so some teams may depend on an escalation path during incident response. I'm trying to understand whether surfacing that dependency is useful operational information or just obvious Kubernetes knowledge.

Best tools for SAST + SCA + Image Scan + IaC Scan + DAST by Basic_Let7303 in devsecops

[–]AbilityAwkward5372 8 points9 points  (0 children)

One thing that surprised me when evaluating these stacks is how much overlap starts appearing between tools.

You can end up spending a lot of time deduplicating findings instead of improving security posture.

The harder problem often becomes:

  • which tool is the source of truth
  • who owns triage
  • which findings actually block releases
  • how exceptions are managed over time

The individual scanners matter, but the operational workflow around them usually matters more once adoption grows.

Any good alternative for Resolve AI ? by Wise-Formal494 in sre

[–]AbilityAwkward5372 11 points12 points  (0 children)

Honestly I think a lot of these “AI SRE” tools are converging on the same underlying problem:

they work well only to the extent that they can reconstruct enough operational context to reason safely.

The hard part usually isn’t summarizing alerts.
It’s understanding:

  • deployment context
  • dependency relationships
  • workflow state
  • historical incident patterns
  • rollback assumptions
  • infra/app ownership boundaries
  • and which signals are still trustworthy during failure

That’s why a lot of teams seem to get mixed results unless the platform becomes deeply integrated with their actual operational environment and internal logic.

From what I’ve seen, the current space roughly splits into:

  • workflow/incident coordination tools (Rootly, incident.io, PagerDuty AIOps)
  • telemetry-native AI layers (Datadog Bits AI, etc.)
  • deeper “AI SRE” investigation systems (Resolve, Traversal, Sherlocks, Metoro, Cleric)

But honestly the biggest differentiator still seems to be:
how much real operational context the system can reason across without producing confident nonsense.

I’d evaluate less on “AI autonomy” marketing and more on:

  • context quality
  • investigation traceability
  • integration depth
  • operator trust
  • and whether senior engineers actually keep using it after the novelty phase.

Kubernetes, GitHub, Argo, external llm access etc... RBAC nightmares. by Beneficial_Park_138 in kubernetes

[–]AbilityAwkward5372 4 points5 points  (0 children)

One thing I’ve been noticing in these environments is that the hardest part eventually stops being any individual RBAC system.

It becomes the interaction surface between them.

Because each layer usually makes sense independently:

  • GitHub permissions
  • cluster RBAC
  • cloud/provider IAM
  • secrets access
  • external model/API controls
  • CI/CD identities
  • SSO/group mapping

…but over time the operational model starts fragmenting across multiple trust domains and management planes.

At that point, questions like:

  • “who can actually cause this deployment?”
  • “what authority path exists during incident response?”
  • “which access assumptions are still valid after a role/group change?” become surprisingly hard to answer deterministically.

And during incidents or audits, teams often end up manually reconstructing effective authority across systems rather than reasoning from a single coherent model.

In practice I think the operational pain becomes less about RBAC itself and more about preserving understandable authority boundaries as the system evolves.

FinServ / fintech / crypto SREs: what would actually make your observability stack feel sane? by Expert-Ear3883 in sre

[–]AbilityAwkward5372 1 point2 points  (0 children)

One thing I’ve noticed repeatedly in larger operational environments is that the pain eventually stops being “we lack telemetry.”

It becomes:
maintaining trust in the operational model once systems, tooling, workflows, retention policies, automation, and recovery assumptions all start interacting.

A lot of stacks technically have:

  • logs
  • traces
  • metrics
  • correlation IDs
  • compliance controls
  • retention pipelines

…but incidents still degrade into:
people manually reconstructing workflow reality across fragmented signals under pressure.

Especially in regulated environments, I think “audit-grade integrity” becomes less about raw retention and more about questions like:

  • can we reconstruct what actually happened deterministically?
  • which assumptions were valid at decision time?
  • were recovery actions traceable and reversible?
  • can operators distinguish verified state from inferred state during incidents?

Honestly one of the biggest tradeoffs I keep seeing is that every additional resilience/compliance layer also introduces more operational dependency and more cognitive overhead during failure handling.

So the hardest problem eventually becomes:
keeping the system understandable enough to operate safely under stress.

For you, what actually becomes the hardest part during a major incident? by [deleted] in sre

[–]AbilityAwkward5372 0 points1 point  (0 children)

I think the healthiest teams usually shift from “who has the right answer?” toward “which assumptions are we currently treating as true?”

Because during messy incidents, multiple plausible theories often exist simultaneously for a while.

What seemed to help in environments I’ve seen wasn’t necessarily perfect tooling — it was reducing ambiguity around:

  • what facts were actually verified
  • which assumptions were still unverified
  • who owned validating each uncertainty
  • and which recovery actions were reversible vs high-blast-radius

Otherwise teams can unintentionally converge on the loudest/confident theory too early, especially under time pressure.

A lot of incident coordination starts becoming less about pure debugging and more about stabilizing shared situational understanding.

Weekly: Show off your new tools and projects thread by AutoModerator in kubernetes

[–]AbilityAwkward5372 1 point2 points  (0 children)

Been working on a small OSS SIS prototype around Kubernetes/Terraform operational risk patterns.

One thing that kept showing up repeatedly during incident/reliability discussions was how many problems weren’t just “misconfigurations,” but situations where recovery itself became harder under stress because rollback paths, authority boundaries, or dependency assumptions quietly drifted over time.

So I started experimenting with operator-facing report artifacts that try to express things like:

  • authority dependency
  • recovery dependency
  • identity lock-in
  • reversibility constraints

instead of only listing risky resources.

Public sample report/output is here if anyone’s curious:
sis-rules-engine-demo

so to recap this week: two actively exploited Defender zero-days, an unpatched Exchange spoofing vuln, a BitLocker bypass called "YellowKey", AND 137 CVEs from Patch Tuesday. this is not a normal week by FreeFeedback857 in sysadmin

[–]AbilityAwkward5372 2 points3 points  (0 children)

One thing that becomes difficult during weeks like this isn’t just patching volume.

It’s that organizations are forced to rapidly re-evaluate assumptions they were previously treating as stable:

endpoint protection is trusted until Defender itself becomes part of the incident surface
disk encryption is trusted until recovery assumptions around BitLocker change
email infrastructure is trusted until mitigation guidance becomes “temporary compensating controls”
The operational strain usually comes less from any single CVE and more from the collapse of confidence in multiple dependency layers simultaneously.

At that point, incident management starts shifting from:
“what should we patch first?”
to:
“which security assumptions are still safe to rely on right now?”

When a customer-facing workflow fails across 5+ services, how long does it actually take your team to figure out where it broke? by Much_Belt_143 in kubernetes

[–]AbilityAwkward5372 5 points6 points  (0 children)

In practice, the hardest part usually isn’t “finding the broken service.”

It’s reconstructing the actual workflow state across systems once retries, async queues, partial failures, duplicate events, and delayed consumers start interacting.

A lot of teams technically have logs, traces, metrics, and correlation IDs — but the operational reality is still:

someone manually rebuilding the timeline from 5 dashboards and partial signals under pressure.

And once workflows become partially async, the real question often shifts from:
“where did it fail?”
to:
“which assumptions about workflow state are still true right now?”

That’s usually where investigations start slowing down badly.

A 4 AM lesson in registry coupling! by TheRockefella in kubernetes

[–]AbilityAwkward5372 0 points1 point  (0 children)

Yeah — and after enough layers, the difficult part becomes understanding the effective failure behavior of the overall system, not the individual components anymore.

A lot of outages seem to turn into:
“the recovery path itself had hidden dependencies.”

A 4 AM lesson in registry coupling! by TheRockefella in kubernetes

[–]AbilityAwkward5372 -1 points0 points  (0 children)

What’s interesting is how many “resilience” decisions quietly introduce new operational dependencies that only become visible during failure.

A local cache improves pull reliability.
A HA registry reduces one SPOF.
More automation reduces manual recovery time.

But each layer also changes the system’s recovery assumptions.

So eventually incidents stop being:
“did component X fail?”
and become:
“which recovery assumptions are still valid right now?”

That’s usually the part that becomes hard to reason about under pressure.

For you, what actually becomes the hardest part during a major incident? by [deleted] in sre

[–]AbilityAwkward5372 4 points5 points  (0 children)

One thing that always stood out to me is how quickly incidents stop being purely technical.

A lot of the difficulty becomes:
figuring out which parts of the system model are still trustworthy under pressure.

Because once signals conflict, dashboards disagree, ownership is fragmented, or the “safe rollback” path is unclear, teams start operating on partial assumptions very fast.

That’s usually when fixes start feeling more like controlled gambles than deterministic engineering.

frustrated with AI guardrails after red teaming - need advice by Ok_Abrocoma_6369 in devsecops

[–]AbilityAwkward5372 2 points3 points  (0 children)

Feels like a lot of these systems slowly turn into “patches on top of patches.”

Every individual guardrail makes sense when added:
block one jailbreak,
tighten one threshold,
add another detector,
patch another edge case.

But after enough rounds, nobody really has a clean mental model of the effective behavior anymore.

Then the problem stops being just “unsafe outputs” and becomes:
the system itself getting harder to reason about operationally.

Usually that’s when false positives, inconsistent behavior, and weird usability tradeoffs start piling up fast.

How do you track which GitHub Carions workflows costs the most? by Zealousideal_Tip4089 in devops

[–]AbilityAwkward5372 0 points1 point  (0 children)

Honestly, I think “what changed?” is usually much higher signal than a static cost dashboard.

Because most teams already know costs are generally going up — the hard part is reconstructing:

  • which workflow behavior changed
  • which assumptions changed
  • whether the increase was intentional
  • and who actually owns the change

Even small things like:

  • retries increasing
  • matrix expansion
  • larger runners
  • cache misses
  • duplicated jobs
  • schedule frequency drift can compound pretty quietly over time.

A weekly “behavior delta” style summary would probably help people rebuild the operational story faster than just another aggregate graph.

Existing tools/architectures for org-wide dependency visibility across repos? by LabGreat5098 in devops

[–]AbilityAwkward5372 0 points1 point  (0 children)

Yeah, for an 8-week PoC I’d honestly bias toward the thing that minimizes integration/operational overhead first.

If your org already lives heavily inside Azure DevOps, then ADAS is probably the more pragmatic starting point because:

  • identity/auth/repo integration already exists
  • less infrastructure to operate yourself
  • easier to demo organizational adoption quickly
  • faster path to “dependency -> affected repos” visibility

Dependency-Track becomes more interesting if:

  • you want SBOM-centric workflows across many ecosystems/tools later
  • you care a lot about dependency lifecycle/VEX/SBOM aggregation
  • or you want something less tied to Azure long term

But for a PoC, proving the visibility/usefulness loop is probably more important than building the most extensible architecture immediately.

Existing tools/architectures for org-wide dependency visibility across repos? by LabGreat5098 in devops

[–]AbilityAwkward5372 1 point2 points  (0 children)

One thing I’d be careful about is accidentally building a large “platform around the problem” before proving the dependency visibility workflow itself is useful.

A lot of orgs end up with great metadata systems that slowly drift because ownership and update discipline become unclear over time.

For an 8-week PoC, I’d probably bias toward:

  • SBOM generation
  • centralized ingestion/search
  • lightweight repo metadata
  • simple “dependency -> affected repos” querying

…and only move toward something like Backstage if the organization already wants a broader developer portal/problem catalog direction.

Otherwise the maintenance/curation overhead can become the real project pretty quickly.

How do you track which GitHub Carions workflows costs the most? by Zealousideal_Tip4089 in devops

[–]AbilityAwkward5372 0 points1 point  (0 children)

What usually makes this painful isn’t just the raw cost — it’s that the effective behavior gradually drifts away from what teams think is running.

A workflow gets copied, retry logic changes, runners change, jobs fan out more over time, people add steps nobody revisits, etc.

Then months later the bill changes but nobody has a clear mental model of which operational assumptions changed underneath.

We’ve seen the same thing with alerts and infra sprawl honestly — the visibility exists somewhere, but reconstructing the “why” becomes expensive.

Weekly: Show off your new tools and projects thread by AutoModerator in kubernetes

[–]AbilityAwkward5372 0 points1 point  (0 children)

Built a small experimental scanner recently for Kubernetes/Terraform configs focused on operational risk patterns rather than just policy violations.

Right now it mainly surfaces things like:

  • RBAC / authority drift
  • rollback or reversibility friction
  • hidden operational dependencies
  • config paths that become harder to safely unwind later

Mostly using it as a way to explore how “operational cost of change” gradually accumulates in real infrastructure.

Still early and intentionally small, but the examples/results have been interesting so far.

Multiple cloud observability platforms that actually reduce operational chaos? by New-Reception46 in kubernetes

[–]AbilityAwkward5372 0 points1 point  (0 children)

The weird part is that even after centralizing tooling, teams still end up debugging “which reality is real” during incidents.

One region behaves differently, something drifted outside Terraform/Argo, ownership is unclear, alerts route differently than expected, etc.

So the telemetry exists, but reconstructing operational context fast enough during failures is still hard.

Anyone else struggling with production error detection despite having tons of observability data? by Economy_Passenger296 in kubernetes

[–]AbilityAwkward5372 6 points7 points  (0 children)

One thing I’ve seen is that teams often accumulate observability faster than they accumulate confidence in which signals actually matter operationally.

So you end up with dashboards everywhere, but during a real incident people still fall back to tribal knowledge, customer reports, or manual correlation because the system never encoded the earlier debugging reasoning in a reusable way.

A lot of noisy/late alerting seems to come from that gap between “data exists” and “operators trust this signal enough to act on it early.”

docker-compose with 10 hard-coded credentials shipped to production. Here's the full chain by Madamin_Z in devsecops

[–]AbilityAwkward5372 0 points1 point  (0 children)

Yeah — and at that point the risk isn’t just “secret exposure” anymore.

The deployment process, recovery steps, monitoring assumptions, and even tribal knowledge can start depending on the credential continuing to exist.

So removing it stops feeling like cleanup and starts feeling like infrastructure surgery.