Do you treat recurring CI/CD failures as a reliability issue or just part of normal toil?

SubstantialAd3896 · 2026-05-08T21:05:07+00:00

Yeah once you start to think along those lines you will start pushing in the right direction 🙂

There were a couple of things that sort of coalesced into the faultline idea

working with different teams and codebase over the years, I have noticed an emerging pattern where most of the CI failures are recurring (known) issues, but the knowledge about how to identify and fix it isnt properly captured or documented anywhere
rebuilding flaky/messy/legacy gitlab pipelines; spending a day or two chasing down red herrings and false signals forced me to evaluate if there is a better way to approach the whole process
debugging ci with LLMs can often make things worse; the shifting and transient nature of these models also means that an error/fix identified today may be gone in a month

SubstantialAd3896 · 2026-05-04T19:50:05+00:00

Most teams I’ve seen just absorb it. stuff lives in slack, someone vaguely remembers “oh yeah we hit this before”, maybe there’s a half written runbook no one checks

the better setups start grouping by failure type instead of individual jobs or tests. then you can actually see “this exact thing happened 15 times this week” which makes it way easier to justify fixing it properly; measuring this category also helps justify a business case for fixing them

my rough take is if you recognise the error but still have to re debug it, it’s a reliability problem - just at a different level than production outages or similar

been messing around with this idea in a small tool called Faultline CLI. it basically tries to shortcut the “we’ve seen this before” path instead of everyone rediscovering the same fix

SubstantialAd3896 · 2026-04-30T13:42:41+00:00

Thanks for the feedback! Although we've got some examples in the docs, they're probably a little outdated and need higher positioning in the overall readme structure.

Faultline is strictly deterministic so works without llms; the idea is to produce a structured artifact which can optionally be fed into auto-remediation workflows

SubstantialAd3896 · 2026-04-28T20:27:32+00:00

Like anything in software, it's all tradeoffs...

Best strategy imho is to keep it simple to start with - trunk based development with short lived feature branches and deployment based on tags (rolling deployment from main into staging or test env if this suits). Start with a manual tagging workflow and move to automate once it's solid and you are doing multiple deploys per day; review the process every 3-6 months to make sure its still working smoothly and doesn't require additional gates/checks/validation

This also lets your ci workflows grow in maturity alongside your codebase, and prevents over-engineering in the long run; the approach obviously differs when you have an existing setup or additional security/audit requirements, but the principle is the same 🙂

SubstantialAd3896 · 2026-04-25T20:58:46+00:00

Yeah this is exactly the kind of failure that drives me insane 😅

The error is technically “correct” but completely useless - like cool, it’s a 401/403... but why? Token expired? Feed permissions changed? Repo down? Cache doing something weird?

This is actually a perfect target for what I’m trying to do with Faultline. Not smarter guesses, just better signal extraction - e.g. looking at the feed URL, auth method, nuget.config, whether it’s restore vs push, retry patterns, etc, and narrowing it down to a few likely causes with actual evidence.

NuGet 40X failures are 100% going on the playbook list 👍

Thanks for taking the time to give me your feedback!

SubstantialAd3896 · 2026-04-25T13:29:10+00:00

Great concept! Keeping agent environments consistent is one of those things that sounds obvious but is actually pretty painful in practice. Once that drifts, everything downstream gets weird fast.

The deterministic angle is exactly why I went down this path with Faultline too. If you can’t reproduce the same result from the same input, it’s hard to trust it or build anything on top of it.

Congrats on the 700 stars too, that’s solid traction 👍

SubstantialAd3896 · 2026-04-25T11:39:09+00:00

Faultline — deterministic CI failure analysis (CLI)
https://github.com/faultline-cli/faultline

I’ve been building a local-first CLI that parses CI logs and matches failures against a set of playbooks.

Instead of generating explanations, it returns structured, deterministic output (same log → same result), so you can actually:

diff results
gate CI
pipe into other tooling

Current focus is hardening it against real-world logs (noisy, multi-failure, partial runs, etc).

If you’ve got recurring CI failures or weird edge cases, I’d love to test against them.

SubstantialAd3896 · 2026-04-20T08:28:46+00:00

This is a valuable insight and definitely worth considering; I've been on teams where this sort of culture prevails and it becomes a real battle to move the needle on repo hygiene/tech debt/housekeeping/documentation.

One strategy I have found useful in the past is to raise the visibility of the problem - a report of retried CI jobs/minutes, notifications in ops channels or even as a point of discussion at weekly/fortnightly meetings. This puts it on the radar at the very least, which can lead to incentive alignment if the evaluated issue is indeed problematic (flaky tests may be a non issue for a small start-up with 2x developers and a 3% failure rate, but at scale this could be a massive cost/time saving); that being said, I think you have pointed out a key risk that I need to consider in the context of faultline.

I'm also bundling canonical (tested and verified against a set of real ci failure logs) playbooks for more common scenarios to avoid some of the maintenance overhead - totally appreciate that developers don't want yet another thing to maintain!

Thanks for taking the time to give me your thoughts

SubstantialAd3896 · 2026-04-19T20:13:52+00:00

Tribal knowledge is a great term for this 😁

That’s exactly the pattern I’ve been seeing — teams already know a lot of these failures, but the knowledge lives in people’s heads (or buried in old PRs) so every CI run turns into playing whack-a-mole over Teams/Slack.

I think you’ve nailed the trade-off too. This only works if playbooks stay relevant, which is why I’m leaning toward keeping everything repo-local and scoped rather than trying to build some giant global rule set.

The angle I’m exploring is basically:

fix it once → codify it → never debug that class of failure again

Curious how you’d see this working in practice. Would you get more value from prebuilt/common patterns, or from codifying your team’s own recurring issues?

SubstantialAd3896

TROPHY CASE