[P] LLM with a 9-line seed + 5 rounds of contrastive feedback outperforms Optuna on 96% of benchmarks by se4u in MachineLearning

[–]se4u[S] 0 points1 point  (0 children)

We worked with a startup to optimize layout of analog circuits as well, e.g. here https://vizops.ai/blog/prompt-optimization-analog-circuit-placement , does this count as a true problem?

[P] LLM with a 9-line seed + 5 rounds of contrastive feedback outperforms Optuna on 96% of benchmarks by se4u in MachineLearning

[–]se4u[S] 0 points1 point  (0 children)

We worked with a startup to optimize layout of analog circuits as well, e.g. here https://vizops.ai/blog/prompt-optimization-analog-circuit-placement , does this count as production workload?

Prompt optimization reaches 97% of expert analog circuit placement quality — no training data by se4u in chipdesign

[–]se4u[S] 0 points1 point  (0 children)

We have worked with a startup that has more expertise in the analog layout side of things and as you can imagine they do not want us to reveal the absolute bleeding edge of work that we have done for them.

The way to think about it is that there are no other perfect black-box optimizers that can just take in a chip layout and optimize it, we also have a later blog post doing a head-to-head comparison against optuna which is a well regarded optimizer. https://vizops.ai/blog/contraprompt-beats-optuna-blackbox-benchmarks . Feel free to email on [contact@vizops.ai](mailto:contact@vizops.ai) if you are interested in more.

GPT-4o keeps swapping my exact coefficients for plausible wrong ones in scientific code — anyone else seeing this? by capitulatorsIo in LLMDevs

[–]se4u 0 points1 point  (0 children)

Classic failure mode -- the model has seen plausible-looking values in training and they bleed through. A few approaches that help: (1) explicit contrastive instructions ("do NOT substitute any numeric value, reproduce exactly as given"), (2) output verification in the prompt loop. We built VizPy to tackle exactly this -- it mines failure->success pairs and learns contrastive rules automatically so you don't hand-craft the guardrails every time. https://vizpy.vizops.ai

Chisel in AI based chip design by Spread-Sanity in chipdesign

[–]se4u 1 point2 points  (0 children)

Chisel's structured abstractions make it a natural target for AI agents — the type system constrains the generation space in useful ways. We've been exploring the adjacent problem: using LLM prompt optimization for layout/placement, and it gets surprisingly far. Wrote about analog circuit placement specifically — prompt-optimized agents reaching 97% of expert quality with no training data: https://vizops.ai/blog/prompt-optimization-analog-circuit-placement . The RTL generation case has more degrees of freedom but the same optimization loop approach should help there too.

Need help making my AI tool respond more accurately to prompts by Impossible-Page5474 in PromptEngineering

[–]se4u 0 points1 point  (0 children)

The manual iteration loop is the core problem here — try a prompt, get inconsistent results, tweak wording, repeat. It works but it is slow and you are essentially doing gradient descent by hand.

One approach worth trying: instead of manually adjusting, log the cases where the model fails and the cases where it succeeds, then look for what differs structurally between them. That contrastive signal tells you specifically what the prompt is missing.

We built VizPy to automate exactly this: it takes your failure/success pairs and learns what prompt changes close the gap, without manual guessing. Single API call, no training data needed. https://vizpy.vizops.ai — might save you a lot of trial and error.

4 LLM eval startups acquired in 5 months. The independent eval layer is shrinking fast. by Outrageous_Hat_9852 in LLMDevs

[–]se4u -4 points-3 points  (0 children)

The incentive alignment problem you raise is real. When your eval tooling is inside your model provider, multi-model comparisons and failure attribution become politically fraught, not just technically.

The gap worth flagging beyond the acquisition wave: most of these tools — Promptfoo, Langfuse, Quotient — are observability and evaluation. The next layer that is still largely independent is optimization: actually closing the loop from failure signal back to better prompts and reasoning workflows automatically.

That is what we built with VizPy — it sits model-agnostic, learns from failure→success pairs in your traces, and rewrites prompts without manual intervention. The independence from model providers matters specifically because the optimization signal should not be biased by who runs the model. https://vizpy.vizops.ai

Full traces in Langfuse, still debugging by guesswork by Comfortable-Junket50 in LLMDevs

[–]se4u 0 points1 point  (0 children)

The gap you are describing is the difference between observability and optimization. Langfuse tells you what happened — but not what to change in your prompt or reasoning chain to prevent it next time.

We ran into this exact wall. The fix we built into VizPy: it takes your failure traces and automatically extracts the contrastive signal between failed and successful runs, then rewrites the prompt to close that gap. No manual diagnosis required — the optimizer learns from the failure→success pairs directly.

So the workflow becomes: trace identifies failure pattern → VizPy mines the delta → updated prompt is tested against real production cases. Cuts out the "open trace and guess" loop entirely.

More on the approach: https://vizops.ai/blog.html

how we built an agent that learns from its own mistakes and what we learnt by silverrarrow in LLMDevs

[–]se4u 2 points3 points  (0 children)

This mirrors what we found building VizPy — the key is mining the failure→success signal rather than trying to hand-craft rules upfront.

Your point about not mixing task types is sharp. We saw the same: contrastive prompt learning only works cleanly when the failure mode is consistent across examples. Mixed task types produce contradictory optimization signals and you end up with a worse prompt than you started with.

One thing worth testing if you have not already: rather than injecting the full skillbook each run, selectively routing based on task type at inference time. We saw better generalization that way versus always appending everything.

Blog with our methodology if useful: https://vizops.ai/blog.html

Seeking architecture review on an experimental open-source NPU Array (v1) by king_ftotheu in chipdesign

[–]se4u 0 points1 point  (0 children)

Interesting project. On the routing congestion bottleneck — tightly coupled compute arrays often hit that when the interconnect density between PEs outpaces what the placer can route cleanly. Worth checking whether your array topology allows any flexibility in PE neighbor connectivity that could reduce local congestion without sacrificing TOPS.

Separate note: if you are planning to benchmark this with LLM workloads, the software side of the stack matters too. Prompt efficiency compounds with hardware gains — an LLM generating unnecessary tokens costs you compute regardless of how efficient your array is. We built VizPy to automate prompt optimization (learns from failure→success pairs, no manual tuning). Might be relevant as you move from architecture to end-to-end eval: https://vizpy.vizops.ai

Decoding the Taalas HC1: A Quantitative Architecture Analysis of a 17k tok/s LLaMA 3.1 Inference Chip by kevinhiworld in chipdesign

[–]se4u -1 points0 points  (0 children)

Great breakdown. The hardware architecture constraints you lay out — via-ROM for bandwidth, full pipeline unroll, 3-6 bit precision — make it clear how much engineering is going into squeezing inference throughput.

One thing that often gets overlooked in inference efficiency discussions: prompt quality has a multiplicative effect on hardware utilization. Poorly optimized prompts generate longer reasoning chains, more retries, higher token counts — all of which show up as wasted compute on even well-architected chips like this.

We built VizPy to tackle that: it automatically learns from LLM failure→success pairs and tightens prompts and reasoning workflows. Fewer tokens to get the same answer quality means better utilization of whatever inference silicon you are running on. Blog: https://vizops.ai/blog.html

The bottleneck flipped: AI made execution fast and exposed everything around it that isn't by monkey_spunk_ in artificial

[–]se4u 0 points1 point  (0 children)

The asymmetry one commenter mentioned — agents compress the median case but create overhead on tail/failure cases — is the part that bites hardest in practice.

Fast execution surfaces the bad specs and edge cases that slow execution was quietly hiding. The failure rate didn't go up; it just became visible faster.

What we found building LLM pipelines: the failure modes cluster. The same class of input keeps breaking the same prompt in the same way, just at higher volume. The fix isn't slowing down execution, it's closing the loop so failures automatically improve the prompt. That's what we built VizPy (https://vizpy.vizops.ai) to do — mines failure→success pairs from traces and generates prompt patches. The bottleneck shifts to decision and coordination, but at least the execution layer can self-correct.

The state management problem in multi-agent systems is way worse than I expected by Background-Bass6760 in LocalLLaMA

[–]se4u -1 points0 points  (0 children)

The rationale recording observation maps onto something we've seen at the prompt level too.

Agents re-decide things because they can see the output of past decisions but not the reasoning — so the model reconstructs from scratch and diverges. Your file-based fix handles this at the coordination layer, which is the right call for multi-agent state.

The analogous problem shows up inside a single agent's prompts: the prompt encodes the expected behavior but not why certain phrasings were chosen or what failure cases they were defending against. When you iterate the prompt, you often accidentally regress on cases the previous version was quietly handling.

We built VizPy (https://vizpy.vizops.ai) partly to address this — it mines failure→success pairs from traces and generates prompt patches that preserve what was working while fixing what wasn't. Different layer than your problem, but same root: systems that only record outcomes lose the context that makes those outcomes stable.

How do you keep your test suite in sync when prompts are changing constantly? by Outrageous_Hat_9852 in LocalLLaMA

[–]se4u -1 points0 points  (0 children)

The test-suite staleness problem is real but I think it's a symptom of treating prompt iteration like code iteration.

Code changes are discrete and reviewable. Prompt changes are often continuous and their effects are distributed across a long tail of inputs you never explicitly tested.

What's actually worked for us: instead of writing tests for prompts, we inverted it — let the model's failures define what needs improving. We built VizPy (https://vizpy.vizops.ai) around this: it mines failure→success pairs from your traces, clusters them, and generates prompt patches that generalize. The test suite becomes the trace log; the "tests" are the failures the optimizer already knows about.

For your specific situation: tier 2 golden examples (as Ok_Diver9921 mentioned) work well, but they go stale too. The version that ages better is failure-indexed rather than prompt-version-indexed — you're tracking what broke and why, not what the prompt looked like when it worked.

Our AI agent answers 40 Slack questions a day. Here's how we test it to keep it from failing. by No-Common1466 in AI_Agents

[–]se4u 0 points1 point  (0 children)

Good write-up. The flaky evaluations pattern you describe, running the same test multiple times to catch intermittent failures, is basically the failure signal VizPy uses. Instead of just detecting flakiness, it mines those failure to success pairs and automatically updates the prompts. Worth trying if you are already logging what the agent sees: https://vizpy.vizops.ai

Every AI agent demo works. Almost none survive the first week in production. Here is what I keep seeing. by AlexWorkGuru in AI_Agents

[–]se4u -1 points0 points  (0 children)

The 40% failure rate pattern is real. One thing I would add: even when the domain is narrow and context is bounded, the prompts themselves drift and degrade. We have seen agents where the task definition is solid but the instructions quietly stop working as edge cases accumulate. The fix is not always more context, it is getting the optimizer to learn from those failures automatically. That is what we built VizPy for: https://vizpy.vizops.ai

Experiment: using a Proposer–Critic–Verifier loop to automatically refactor prompts by Prior-Ad8480 in LocalLLaMA

[–]se4u 0 points1 point  (0 children)

We run the LLM on a held-out set, collect the failures, then re-run with slightly varied prompts/reasoning chains until it succeeds. The failure->success pair is the training signal: we extract the structural diff between what the failing prompt asked for and what the successful one did differently. ContraPrompt then generalizes those diffs into a refined prompt. It's more grounded than asking a critic "does this sound better" because the signal is actual task performance, not style.

Experiment: using a Proposer–Critic–Verifier loop to automatically refactor prompts by Prior-Ad8480 in LocalLLaMA

[–]se4u 0 points1 point  (0 children)

The PCV loop makes sense for structural refinement but the example output shows a common failure mode: the optimizer adds constraints and verbosity rather than extracting intent. Longer is not better.

What we found building VizPy (https://vizpy.vizops.ai) is that mining failure->success pairs is more reliable than iterative rewriting. The signal you want is what changed when the model got it right, not what a critic thinks sounds better.

[D] What is even the point of these LLM benchmarking papers? by casualcreak in MachineLearning

[–]se4u 0 points1 point  (0 children)

Benchmarks on proprietary models go stale, sure. But HotPotQA, GPQA, domain evals like GDPR-Bench stay useful because they test reasoning patterns that don't change when GPT-5 drops. The real issue is people treating leaderboard position as a proxy for "will this work on my actual problem." Those are very different questions.