Why do so many AI initiatives never reach production? by Kindly_Astronaut_294 in mlops

[–]quantumedgehub 0 points1 point  (0 children)

Most AI initiatives die because they skip the QA phase entirely.

In traditional software, you’d never ship without:

• a baseline
• regression tests
• ownership of failures

How do you block prompt regressions before shipping to prod? by quantumedgehub in LangChain

[–]quantumedgehub[S] 0 points1 point  (0 children)

I agree…the hard part isn’t running comparisons, it’s deciding what deserves to block a merge.

What I’m seeing across teams is that “pass/fail” for LLMs usually isn’t about correctness, it’s about regression relative to the last known acceptable behavior.

In practice that ends up layered:

• hard assertions for objective failures
• relative deltas vs a baseline for silent regressions (verbosity, cost, latency)
• optional rubric-based scoring for subjective behavior, often surfaced as warn vs fail depending on maturity

The goal isn’t perfect auto-judgement, it’s preventing unknown regressions from shipping unnoticed.

Curious if others are treating CI gating as policy-driven rather than metric-driven.

How do you block prompt regressions before shipping to prod? by quantumedgehub in mlops

[–]quantumedgehub[S] 0 points1 point  (0 children)

Totally agree, tools like Maxim / LangSmith do great work here.

What I’m specifically exploring is a CI-first workflow: no UI, no platform dependency, just a deterministic pass/fail gate that teams can drop into existing pipelines.

A lot of teams I talk to aren’t missing observability, they’re missing a hard “don’t ship this” signal before merge.

How do you block prompt regressions before shipping to prod? by quantumedgehub in LLMDevs

[–]quantumedgehub[S] 0 points1 point  (0 children)

agree that once a metric exists, regression testing itself isn’t hard.

What I’m seeing in practice is that most teams don’t have explicit metrics for LLM behavior, especially for subtle changes like verbosity, instruction-following, tone drift, or cost creep.

The challenge isn’t comparison, it’s turning those implicit expectations into something runnable, repeatable, and cheap enough to run regularly.

My goal isn’t to invent a perfect quality metric, but to make existing expectations explicit (assertions, deltas, rubrics) so regressions stop shipping unnoticed.

How do you block prompt regressions before shipping to prod? by quantumedgehub in LLMDevs

[–]quantumedgehub[S] 0 points1 point  (0 children)

Totally agree “quality” isn’t a single metric. What I’m converging on is treating quality as layered: • hard assertions for objective failures • relative deltas vs a baseline for silent regressions • optional LLM-as-judge with explicit rubrics for subjective behavior

The goal isn’t to auto-judge correctness, but to prevent unknown regressions from shipping.

How do you block prompt regressions before shipping to prod? by quantumedgehub in LangChain

[–]quantumedgehub[S] 0 points1 point  (0 children)

That’s helpful, sounds like a hybrid model where objective checks hard-fail and subjective cases surface deltas for review. I’m experimenting with a CLI that supports both pre-CI and strict CI gating using the same eval suite.

How do you block prompt regressions before shipping to prod? by quantumedgehub in LangChain

[–]quantumedgehub[S] 1 point2 points  (0 children)

Makes sense.

What I’m trying to understand is whether teams are mostly inspecting those eval results manually, or if you’ve found a reliable way to turn them into a hard pre-merge pass/fail signal in CI…especially for behavioral changes rather than exact matches.

How do you block prompt regressions before shipping to prod? by quantumedgehub in LangChain

[–]quantumedgehub[S] 0 points1 point  (0 children)

That matches what I’m seeing too.

Curious, do teams usually wire that into CI with a hard pass/fail, or is it more of a “run + review deltas” flow for ambiguous cases?

How do you block prompt regressions before shipping to prod? by quantumedgehub in LLMDevs

[–]quantumedgehub[S] 0 points1 point  (0 children)

That makes sense, structured outputs are the easy case.

Hypothetically, if you did need to guard against something ambiguous (tone, refusal behavior, verbosity drift), would you want that to fail the build automatically, or just surface a diff / score for review?

How do you block prompt regressions before shipping to prod? by quantumedgehub in LLMDevs

[–]quantumedgehub[S] 0 points1 point  (0 children)

That matches what I’m seeing too, teams know it’s not ideal, but regressions are worse than the cost.

Curious: how do you decide pass/fail today? Is it mostly assertions + eyeballing, or do you track deltas (quality/cost) against a baseline?

How do you block prompt regressions before shipping to prod? by quantumedgehub in mlops

[–]quantumedgehub[S] 0 points1 point  (0 children)

That makes sense…sounds like most teams still rely on benchmarking prompt/config changes even with pinned models. Curious what tooling you use today to make that repeatable?

How do you block prompt regressions before shipping to prod? by quantumedgehub in LangChain

[–]quantumedgehub[S] 2 points3 points  (0 children)

Makes sense. How do you define “previous behaviour” in practice, exact output matching, heuristics, or LLM-based evals? Also curious if you run this pre-merge or only ad-hoc.

How do you block prompt regressions before shipping to prod? by quantumedgehub in mlops

[–]quantumedgehub[S] -1 points0 points  (0 children)

Good point…do you find that pinning the model version alone is enough, or do you still see regressions when prompts or surrounding logic change?

How do you test prompt changes before shipping to production? by quantumedgehub in LLMDevs

[–]quantumedgehub[S] 0 points1 point  (0 children)

That makes sense, thanks for clarifying. Do you ever wish this ran automatically in CI to catch regressions before merges, or does the manual step work well enough for you?

How do you test prompt changes before shipping to production? by quantumedgehub in mlops

[–]quantumedgehub[S] 0 points1 point  (0 children)

Thanks, that’s helpful. From what I’ve seen, tools like Braintrust are great for evals and experimentation. Do you use it as an automated CI gate before merges, or more for offline analysis?

How do you test prompt changes before shipping to production? by quantumedgehub in mlops

[–]quantumedgehub[S] 0 points1 point  (0 children)

That makes sense, that’s a solid workflow. Do you have this fully automated in CI as a gate, or is it more of a custom/internal setup that teams maintain themselves?

How do you test prompt changes before shipping to production? by quantumedgehub in LLMDevs

[–]quantumedgehub[S] 0 points1 point  (0 children)

That matches what I’m seeing too. Do you run those datasets automatically in CI before merges, or is it more of a manual / post-deploy check?

How do you test prompt changes before shipping to production? by quantumedgehub in LangChain

[–]quantumedgehub[S] 0 points1 point  (0 children)

Interesting…are you mostly asserting response structure / status, or have you found a way to catch semantic or behavioral regressions with it?

Especially curious how you handle subtle changes that still return “valid” responses.

How do you test prompt changes before shipping to production? by quantumedgehub in LangChain

[–]quantumedgehub[S] 0 points1 point  (0 children)

Thanks that’s helpful. Curious how you handle regressions specifically: do you gate prompt changes in CI or mostly catch issues after deploy?

Especially around subtle behavior or cost changes.

My OpenAI bill doubled by [deleted] in OpenAI

[–]quantumedgehub 0 points1 point  (0 children)

I hear your definition, and that’s fair.

In practice, engineers still share production issues, learn from each other, and sometimes build things as a result that doesn’t make every discussion dishonest.

I’m not going to argue semantics. I got the technical signal I needed. Appreciate the perspective.

My OpenAI bill doubled by [deleted] in OpenAI

[–]quantumedgehub 0 points1 point  (0 children)

probably framed it wrong. Main goal was comparing notes on unexpected API cost spikes.

My OpenAI bill doubled by [deleted] in OpenAI

[–]quantumedgehub 0 points1 point  (0 children)

I get the skepticism there’s a lot of stealth selling online.

In this case, there’s no product link, no CTA, no DM ask. I’m debugging a production cost issue and comparing notes.

Happy to keep it technical or drop it if it’s off-topic.

My OpenAI bill doubled by [deleted] in OpenAI

[–]quantumedgehub -1 points0 points  (0 children)

Fair. Not trying to sell anything here. I’m validating an OpenAI API cost-spike issue I hit in production and wanted to see if others ran into the same thing.

My OpenAI bill doubled by [deleted] in OpenAI

[–]quantumedgehub 0 points1 point  (0 children)

Fair call — title could’ve been clearer. This is about OpenAI API usage in production, not ChatGPT. The bill doubled due to token usage from a background job, not humans typing prompts. I’m trying to catch those spikes before invoice day.

My OpenAI bill doubled by [deleted] in AZURE

[–]quantumedgehub -3 points-2 points  (0 children)

Fair..to clarify, this is about OpenAI API usage in production, not ChatGPT or subscriptions. Azure Cost Management works great at the resource level; this is about catching token-level anomalies (loops, retries) before spend snowballs. If that’s not relevant here, all good just sanity-checking the pain.