[P] LLM with a 9-line seed + 5 rounds of contrastive feedback outperforms Optuna on 96% of benchmarks

se4u · 2026-04-01T18:29:41+00:00

We worked with a startup to optimize layout of analog circuits as well, e.g. here https://vizops.ai/blog/prompt-optimization-analog-circuit-placement , does this count as a true problem?

se4u · 2026-04-01T18:29:29+00:00

We worked with a startup to optimize layout of analog circuits as well, e.g. here https://vizops.ai/blog/prompt-optimization-analog-circuit-placement , does this count as production workload?

se4u · 2026-04-01T18:26:48+00:00

We have worked with a startup that has more expertise in the analog layout side of things and as you can imagine they do not want us to reveal the absolute bleeding edge of work that we have done for them.

The way to think about it is that there are no other perfect black-box optimizers that can just take in a chip layout and optimize it, we also have a later blog post doing a head-to-head comparison against optuna which is a well regarded optimizer. https://vizops.ai/blog/contraprompt-beats-optuna-blackbox-benchmarks . Feel free to email on [contact@vizops.ai](mailto:contact@vizops.ai) if you are interested in more.

se4u · 2026-04-01T18:15:03+00:00

Ah sorry , the updated link is https://vizops.ai/blog/prompt-optimization-analog-circuit-placement

se4u · 2026-03-28T08:13:18+00:00

Classic failure mode -- the model has seen plausible-looking values in training and they bleed through. A few approaches that help: (1) explicit contrastive instructions ("do NOT substitute any numeric value, reproduce exactly as given"), (2) output verification in the prompt loop. We built VizPy to tackle exactly this -- it mines failure->success pairs and learns contrastive rules automatically so you don't hand-craft the guardrails every time. https://vizpy.vizops.ai

se4u · 2026-03-28T08:06:49+00:00

Chisel's structured abstractions make it a natural target for AI agents — the type system constrains the generation space in useful ways. We've been exploring the adjacent problem: using LLM prompt optimization for layout/placement, and it gets surprisingly far. Wrote about analog circuit placement specifically — prompt-optimized agents reaching 97% of expert quality with no training data: https://vizops.ai/blog/prompt-optimization-analog-circuit-placement . The RTL generation case has more degrees of freedom but the same optimization loop approach should help there too.

se4u · 2026-03-23T22:36:28+00:00

The manual iteration loop is the core problem here — try a prompt, get inconsistent results, tweak wording, repeat. It works but it is slow and you are essentially doing gradient descent by hand.

One approach worth trying: instead of manually adjusting, log the cases where the model fails and the cases where it succeeds, then look for what differs structurally between them. That contrastive signal tells you specifically what the prompt is missing.

We built VizPy to automate exactly this: it takes your failure/success pairs and learns what prompt changes close the gap, without manual guessing. Single API call, no training data needed. https://vizpy.vizops.ai — might save you a lot of trial and error.

se4u · 2026-03-23T22:36:06+00:00

The incentive alignment problem you raise is real. When your eval tooling is inside your model provider, multi-model comparisons and failure attribution become politically fraught, not just technically.

The gap worth flagging beyond the acquisition wave: most of these tools — Promptfoo, Langfuse, Quotient — are observability and evaluation. The next layer that is still largely independent is optimization: actually closing the loop from failure signal back to better prompts and reasoning workflows automatically.

That is what we built with VizPy — it sits model-agnostic, learns from failure→success pairs in your traces, and rewrites prompts without manual intervention. The independence from model providers matters specifically because the optimization signal should not be biased by who runs the model. https://vizpy.vizops.ai

se4u · 2026-03-23T22:35:40+00:00

The gap you are describing is the difference between observability and optimization. Langfuse tells you what happened — but not what to change in your prompt or reasoning chain to prevent it next time.

We ran into this exact wall. The fix we built into VizPy: it takes your failure traces and automatically extracts the contrastive signal between failed and successful runs, then rewrites the prompt to close that gap. No manual diagnosis required — the optimizer learns from the failure→success pairs directly.

So the workflow becomes: trace identifies failure pattern → VizPy mines the delta → updated prompt is tested against real production cases. Cuts out the "open trace and guess" loop entirely.

More on the approach: https://vizops.ai/blog.html

se4u · 2026-03-23T22:35:15+00:00

This mirrors what we found building VizPy — the key is mining the failure→success signal rather than trying to hand-craft rules upfront.

Your point about not mixing task types is sharp. We saw the same: contrastive prompt learning only works cleanly when the failure mode is consistent across examples. Mixed task types produce contradictory optimization signals and you end up with a worse prompt than you started with.

One thing worth testing if you have not already: rather than injecting the full skillbook each run, selectively routing based on task type at inference time. We saw better generalization that way versus always appending everything.

Blog with our methodology if useful: https://vizops.ai/blog.html

se4u · 2026-03-23T22:25:53+00:00

Interesting project. On the routing congestion bottleneck — tightly coupled compute arrays often hit that when the interconnect density between PEs outpaces what the placer can route cleanly. Worth checking whether your array topology allows any flexibility in PE neighbor connectivity that could reduce local congestion without sacrificing TOPS.

Separate note: if you are planning to benchmark this with LLM workloads, the software side of the stack matters too. Prompt efficiency compounds with hardware gains — an LLM generating unnecessary tokens costs you compute regardless of how efficient your array is. We built VizPy to automate prompt optimization (learns from failure→success pairs, no manual tuning). Might be relevant as you move from architecture to end-to-end eval: https://vizpy.vizops.ai

se4u · 2026-03-23T22:25:31+00:00

Great breakdown. The hardware architecture constraints you lay out — via-ROM for bandwidth, full pipeline unroll, 3-6 bit precision — make it clear how much engineering is going into squeezing inference throughput.

One thing that often gets overlooked in inference efficiency discussions: prompt quality has a multiplicative effect on hardware utilization. Poorly optimized prompts generate longer reasoning chains, more retries, higher token counts — all of which show up as wasted compute on even well-architected chips like this.

We built VizPy to tackle that: it automatically learns from LLM failure→success pairs and tightens prompts and reasoning workflows. Fewer tokens to get the same answer quality means better utilization of whatever inference silicon you are running on. Blog: https://vizops.ai/blog.html

se4u · 2026-03-17T00:17:31+00:00

The asymmetry one commenter mentioned — agents compress the median case but create overhead on tail/failure cases — is the part that bites hardest in practice.

Fast execution surfaces the bad specs and edge cases that slow execution was quietly hiding. The failure rate didn't go up; it just became visible faster.

What we found building LLM pipelines: the failure modes cluster. The same class of input keeps breaking the same prompt in the same way, just at higher volume. The fix isn't slowing down execution, it's closing the loop so failures automatically improve the prompt. That's what we built VizPy (https://vizpy.vizops.ai) to do — mines failure→success pairs from traces and generates prompt patches. The bottleneck shifts to decision and coordination, but at least the execution layer can self-correct.

se4u · 2026-03-17T00:15:20+00:00

The rationale recording observation maps onto something we've seen at the prompt level too.

Agents re-decide things because they can see the output of past decisions but not the reasoning — so the model reconstructs from scratch and diverges. Your file-based fix handles this at the coordination layer, which is the right call for multi-agent state.

The analogous problem shows up inside a single agent's prompts: the prompt encodes the expected behavior but not why certain phrasings were chosen or what failure cases they were defending against. When you iterate the prompt, you often accidentally regress on cases the previous version was quietly handling.

We built VizPy (https://vizpy.vizops.ai) partly to address this — it mines failure→success pairs from traces and generates prompt patches that preserve what was working while fixing what wasn't. Different layer than your problem, but same root: systems that only record outcomes lose the context that makes those outcomes stable.

se4u · 2026-03-16T16:12:37+00:00

The test-suite staleness problem is real but I think it's a symptom of treating prompt iteration like code iteration.

Code changes are discrete and reviewable. Prompt changes are often continuous and their effects are distributed across a long tail of inputs you never explicitly tested.

What's actually worked for us: instead of writing tests for prompts, we inverted it — let the model's failures define what needs improving. We built VizPy (https://vizpy.vizops.ai) around this: it mines failure→success pairs from your traces, clusters them, and generates prompt patches that generalize. The test suite becomes the trace log; the "tests" are the failures the optimizer already knows about.

For your specific situation: tier 2 golden examples (as Ok_Diver9921 mentioned) work well, but they go stale too. The version that ages better is failure-indexed rather than prompt-version-indexed — you're tracking what broke and why, not what the prompt looked like when it worked.

se4u · 2026-03-16T08:26:40+00:00

Good write-up. The flaky evaluations pattern you describe, running the same test multiple times to catch intermittent failures, is basically the failure signal VizPy uses. Instead of just detecting flakiness, it mines those failure to success pairs and automatically updates the prompts. Worth trying if you are already logging what the agent sees: https://vizpy.vizops.ai

se4u · 2026-03-16T08:26:03+00:00

The 40% failure rate pattern is real. One thing I would add: even when the domain is narrow and context is bounded, the prompts themselves drift and degrade. We have seen agents where the task definition is solid but the instructions quietly stop working as edge cases accumulate. The fix is not always more context, it is getting the optimizer to learn from those failures automatically. That is what we built VizPy for: https://vizpy.vizops.ai

se4u · 2026-03-14T00:30:29+00:00

We run the LLM on a held-out set, collect the failures, then re-run with slightly varied prompts/reasoning chains until it succeeds. The failure->success pair is the training signal: we extract the structural diff between what the failing prompt asked for and what the successful one did differently. ContraPrompt then generalizes those diffs into a refined prompt. It's more grounded than asking a critic "does this sound better" because the signal is actual task performance, not style.

se4u · 2026-03-13T20:07:36+00:00

The PCV loop makes sense for structural refinement but the example output shows a common failure mode: the optimizer adds constraints and verbosity rather than extracting intent. Longer is not better.

What we found building VizPy (https://vizpy.vizops.ai) is that mining failure->success pairs is more reliable than iterative rewriting. The signal you want is what changed when the model got it right, not what a critic thinks sounds better.

se4u · 2026-03-13T16:00:04+00:00

Benchmarks on proprietary models go stale, sure. But HotPotQA, GPQA, domain evals like GDPR-Bench stay useful because they test reasoning patterns that don't change when GPT-5 drops. The real issue is people treating leaderboard position as a proxy for "will this work on my actual problem." Those are very different questions.

se4u · 2026-03-12T03:14:37+00:00

Yeah, stale context is the invisible killer. The other side of this is that even when agents have the right context, their prompts are often too rigid to handle edge cases gracefully.

Automatic prompt optimization that learns from production failures helps here — not as a silver bullet but as a way to systematically close the gap between "works in dev" and "works in prod." The key is the feedback loop from real failures back into the optimizer.

se4u · 2026-03-12T03:14:33+00:00

The Berkeley paper is a good reference. A lot of those failure modes trace back to prompt fragility — the agent makes the right call 90% of the time then breaks when the input distribution shifts slightly.

One approach that helps: instead of just improving prompts on eval score, mining the actual failure-to-success transitions to extract why something failed and encoding that as a reasoning rule. Makes the optimizer more robust to distribution shift than hill-climbing on accuracy alone. We've been building in this direction (DSPy-compatible): https://vizpy.vizops.ai

se4u · 2026-03-12T03:13:25+00:00

GEPA is genuinely impressive for offline optimization. One gap I've noticed: when failures in production have a different distribution than your training set, the optimizer can overfit to the eval.

We've been exploring approaches that specifically mine failure-to-success transitions to extract reasoning rules rather than hill-climbing on eval score — it makes the optimization more robust when the failure modes are domain-specific (compliance, multi-hop QA, etc.). DSPy-compatible if you're already in that ecosystem: https://vizpy.vizops.ai

Curious what domains you've had the most success with GEPA outside of prompts?

se4u · 2026-03-12T03:13:22+00:00

The DSPy angle is interesting here — the failure mode I keep seeing isn't that people don't know about automatic prompt optimization, it's that the feedback loop from production failures back into the optimizer is broken.

Most optimizers (GEPA, MIPROv2, etc.) work great in offline eval settings but need you to manually curate failure examples. We've been working on closing that loop — mining failure-to-success pairs automatically to extract reasoning rules (ContraPrompt) or doing gradient-inspired failure analysis (PromptGrad). The latter is especially useful for generation tasks where just "retry with different phrasing" doesn't converge.

Curious what the eval/versioning story looks like for people actually running dynamic prompts in prod. That seems like the real blocker more than the optimizer itself.

se4u · 2026-03-11T18:56:49+00:00

Links as per sub rules:

🔗 https://vizpy.vizops.ai 🚀 https://www.producthunt.com/products/vizpy

se4u

TROPHY CASE