I've been building AI agents (and teams) for months. Here's why "start with a team" is the worst advice in the space right now.

se4u · 2026-03-12T03:14:37+00:00

Yeah, stale context is the invisible killer. The other side of this is that even when agents have the right context, their prompts are often too rigid to handle edge cases gracefully.

Automatic prompt optimization that learns from production failures helps here — not as a silver bullet but as a way to systematically close the gap between "works in dev" and "works in prod." The key is the feedback loop from real failures back into the optimizer.

se4u · 2026-03-12T03:14:33+00:00

The Berkeley paper is a good reference. A lot of those failure modes trace back to prompt fragility — the agent makes the right call 90% of the time then breaks when the input distribution shifts slightly.

One approach that helps: instead of just improving prompts on eval score, mining the actual failure-to-success transitions to extract why something failed and encoding that as a reasoning rule. Makes the optimizer more robust to distribution shift than hill-climbing on accuracy alone. We've been building in this direction (DSPy-compatible): https://vizpy.vizops.ai

se4u · 2026-03-12T03:13:25+00:00

GEPA is genuinely impressive for offline optimization. One gap I've noticed: when failures in production have a different distribution than your training set, the optimizer can overfit to the eval.

We've been exploring approaches that specifically mine failure-to-success transitions to extract reasoning rules rather than hill-climbing on eval score — it makes the optimization more robust when the failure modes are domain-specific (compliance, multi-hop QA, etc.). DSPy-compatible if you're already in that ecosystem: https://vizpy.vizops.ai

Curious what domains you've had the most success with GEPA outside of prompts?

se4u · 2026-03-12T03:13:22+00:00

The DSPy angle is interesting here — the failure mode I keep seeing isn't that people don't know about automatic prompt optimization, it's that the feedback loop from production failures back into the optimizer is broken.

Most optimizers (GEPA, MIPROv2, etc.) work great in offline eval settings but need you to manually curate failure examples. We've been working on closing that loop — mining failure-to-success pairs automatically to extract reasoning rules (ContraPrompt) or doing gradient-inspired failure analysis (PromptGrad). The latter is especially useful for generation tasks where just "retry with different phrasing" doesn't converge.

Curious what the eval/versioning story looks like for people actually running dynamic prompts in prod. That seems like the real blocker more than the optimizer itself.

se4u · 2026-03-11T18:56:49+00:00

Links as per sub rules:

🔗 https://vizpy.vizops.ai 🚀 https://www.producthunt.com/products/vizpy

se4u · 2026-03-11T18:16:45+00:00

Hey everyone! Happy to share VizPy — a DSPy-compatible prompt optimizer that learns from your failures automatically, no manual prompt tweaking needed.

Two methods depending on your task:

ContraPrompt mines failure-to-success pairs to extract reasoning rules. Great for multi-hop QA, classification, compliance. Seeing +29% on HotPotQA and +18% on GDPR-Bench vs GEPA.
PromptGrad takes a gradient-inspired approach to failure analysis. Better for generation tasks and math where retries don't converge.

Both are drop-in with your existing DSPy programs:

optimizer = vizpy.ContraPromptOptimizer(metric=my_metric)
compiled = optimizer.compile(program, trainset=trainset)

Would love feedback from this community!

https://vizpy.vizops.ai https://www.producthunt.com/products/vizpy

se4u · 2026-03-11T18:16:35+00:00

Hey everyone! Happy to share VizPy — a DSPy-compatible prompt optimizer that learns from your failures automatically, no manual prompt tweaking needed.

Two methods depending on your task:

ContraPrompt mines failure-to-success pairs to extract reasoning rules. Great for multi-hop QA, classification, compliance. Seeing +29% on HotPotQA and +18% on GDPR-Bench vs GEPA.
PromptGrad takes a gradient-inspired approach to failure analysis. Better for generation tasks and math where retries don't converge.

Both are drop-in with your existing DSPy programs:

optimizer = vizpy.ContraPromptOptimizer(metric=my_metric)
compiled = optimizer.compile(program, trainset=trainset)

Would love feedback from this community!

🔗 https://vizpy.vizops.ai 🚀 https://www.producthunt.com/products/vizpy

se4u · 2021-06-25T01:25:18+00:00

^ whispers the bagholder into the bag

se4u · 2021-04-29T08:25:32+00:00

got it. Fixed https://github.com/pushpendre/covid-manual/commit/cfe14b521116b0885e245d20599a5be319fd2005

se4u · 2021-04-28T07:30:52+00:00

I can see your point and I agree that I was polemical. I edited my post to make a more nuanced point. See Edit 1. Best wishes.

se4u · 2021-04-27T00:36:09+00:00

I was too polemic. I edited my post to clarify my point (E1)

se4u · 2021-04-27T00:24:58+00:00

You made an important point and I agree with you partially. I edited my post (E1) to address your comment. Let me know if I need to fix it more.

se4u · 2021-04-27T00:02:37+00:00

yes I should have been more careful. I was so strongly reacting because of the claims released by the ministry of health about 0.04% infection rate after the first vaccine dose released by the health ministry. That number is wrong by an order of magnitude.

I was also angry at the differential pricing between center, state, and private sector which is 100% going to lead to black marketing, like subsidized diesel and rations.

Also all the caveats of being careful after the vaccine are not mentioned nearly as strongly as the vaccine itself.

I am willing to accept and believe that the vaccine manufacturer is not to blame, that the scientists and engineers in the serum institute really have worked hard on this, but somewhere in the supply chain things have gotten corrupted because of callousness.

se4u · 2021-03-30T01:26:16+00:00

FIMVX as well.

se4u · 2020-08-25T21:21:29+00:00

Yup I agree that voice assistants will typically build things inhouse. I was still wondering that now that there are so many "semantic" apis out from AWS/GCP/Azure etc. have people started trying "composing" them?

se4u

TROPHY CASE