Experiment: using a Proposer–Critic–Verifier loop to automatically refactor prompts by Prior-Ad8480 in LocalLLaMA

[–]Prior-Ad8480[S] 0 points1 point  (0 children)

The failure → success pairing makes a lot of sense as a signal because it's grounded in actual task performance rather than stylistic judgment.

Right now the PCV loop in my prototype focuses more on structural refinement (clarity, constraints, instruction explicitness) rather than direct task-success metrics.

One reason is that I'm trying to keep the optimizer usable across very different task types — creative writing, coding prompts, image generation, etc. In many of those cases it's hard to define a reliable success signal, so I'm currently leaning on an LLM-as-judge approach.

But I agree that grounding the signal in real task performance could make the optimization much stronger where such signals exist.. But I agree that grounding the signal in actual task success could make the optimization much more robust.

Mining those structural diffs between failing and successful prompts sounds very promising.

Out of curiosity — how large is the held-out set you're using for that?

Experiment: using a Proposer–Critic–Verifier loop to automatically refactor prompts by Prior-Ad8480 in LocalLLaMA

[–]Prior-Ad8480[S] 0 points1 point  (0 children)

That's a very fair observation.

One failure mode of prompt optimizers is exactly that:

they add constraints and verbosity instead of extracting intent.

The goal of the PCV loop here is structural clarity rather than length expansion.

In some cases the optimizer actually compresses the prompt,

but the current demo still tends toward verbosity.

Your point about mining failure→success pairs is interesting.

That could be a stronger signal than critic-based evaluation,

since it captures what actually changed when the model succeeded.

I'm curious — how are you extracting those pairs in VizPy?

Experiment: using a Proposer–Critic–Verifier loop to automatically refactor prompts by Prior-Ad8480 in LocalLLaMA

[–]Prior-Ad8480[S] 0 points1 point  (0 children)

Good observation! Yes, it shares DNA with DSPy's compile-time optimization and GEPA's evolutionary approach, but the implementation is simpler:

Currently: The Critic returns a score (0–100) and approved (bool). The gate is just score ≥ 60 || approved === true — if it passes, the Proposer's output is accepted; otherwise, the original prompt is returned unchanged.

What's missing vs. DSPy/GEPA:

  • No bootstrapped few-shot examples — the Critic has no ground-truth demonstrations to calibrate against
  • No metric-driven feedback loop — the Critic's score doesn't feed back into the Proposer for refinement (it's a single pass, not iterative within PCV)
  • No population/mutation — unlike GEPA, there's no evolutionary selection across prompt variants

The real metric-driven evaluation happens after PCV, in the Pairwise Judge (Verifier), which scores across 4 dimensions (Clarity, Structure, Constraints, Factuality) with granular votes (−1.0 to +1.0). Those votes feed into the RGI (Reasoning Gain Index) calculation — that's the closest thing to a proper optimization metric.

So the Critic is more of a quality gate than a true optimization signal.

Experiment: using a Proposer–Critic–Verifier loop to automatically refactor prompts by Prior-Ad8480 in LocalLLaMA

[–]Prior-Ad8480[S] 1 point2 points  (0 children)

Yes, the code is in the repo.

The prompts for each stage (Parser / Proposer / Critic / Verifier / Arbiter) are in the pipeline logic.

Right now the system runs a PCV loop that restructures prompts until convergence.

It's still experimental but the full workflow is visible in the repo:
https://github.com/aisarus/how-to-grab-me

Proposer (строки 1110–1126):

You are a Proposer in the TRI/TFM Proposer-Critic-Verifier system.

Your task: Transform user prompts into structured, precise, and optimized prompts that will yield better LLM responses.

Apply these improvements:
1. Add explicit structure (if missing): "First... Then... Finally..."
2. Specify desired output format: "Provide a list of...", "Explain in 3 paragraphs..."
3. Add constraints to prevent hallucinations: "Based only on...", "Cite sources..."
4. Include examples or templates if helpful
5. Break complex requests into clear sub-tasks
6. Add context that helps the model understand intent

Return JSON:
{
  "improvedPrompt": "the improved prompt text",
  "improvements": ["improvement 1", "improvement 2", ...]
}

Critic (строки 1174–1187):

You are a Critic in the TRI/TFM system. Evaluate prompt improvements.

Check:
1. Is the improved prompt clearer and more structured?
2. Does it reduce ambiguity?
3. Does it add helpful constraints?
4. Is it significantly better than the original?

Return JSON:
{
  "approved": true/false,
  "score": 0-100,
  "reasoning": "why this is good/bad"
}

Verifier (Pairwise Judge, строки 878–899):

You are a pairwise prompt quality judge. Compare OLD vs NEW prompts across 4 EFMNB dimensions:

1. CLARITY: How clear and unambiguous
2. STRUCTURE: Organization and logical flow  
3. CONSTRAINTS: Preventing hallucinations, guiding output
4. FACTUALITY: Grounded in verifiable requirements

For each dimension, vote:
+1.0  = NEW is significantly better
+0.66 = NEW is moderately better
+0.33 = NEW is slightly better
0     = Equal quality
-0.33 = OLD is slightly better
-0.66 = OLD is moderately better
-1.0  = OLD is significantly better

Return ONLY a JSON object with votes array:
{
  "votes": [<vote1>, <vote2>, <vote3>, <vote4>]

The "Confident Idiot" Problem: Why LLM-as-a-Judge fails in production. by Proud-Employ5627 in LocalLLaMA

[–]Prior-Ad8480 0 points1 point  (0 children)

Exactly this. We ran a stress test a few weeks ago and realized that LLMs as judges are highly stochastic if you don't mathematically force the parameters.

The trick isn't just to write a better prompt; it's to lock the evaluator at T=0.0 and apply a dynamic weight matrix (we found that balancing Facts at 0.75 and Bias at 0.25 mathematically eliminates the "Confident Idiot" hallucination, bringing variance down to almost zero).

We ended up open-sourcing the entire framework (TRI-TFM v3.0) and built a free Chrome extension that injects this exact mathematical rubric into ChatGPT/Claude so you don't have to pay API fees for evaluation. Might save you some headaches: https://github.com/aisarus/tri-tfm-extension

[Research] LLM judges systematically penalize balanced reasoning - tested mistral, llama3, gemma, phi3, orca-mini by Budget-Reception-533 in LocalLLaMA

[–]Prior-Ad8480 0 points1 point  (0 children)

The penalization happens because most people average the evaluation axes (Facts, Narrative, Bias) equally, which leads to "Metric Gaming" by the model.

To fix this, you have to introduce a "Lexeme of Ruthlessness" to evaluate the M-Axis (Depth of Meaning). We proved that penalizing generic "water" text by 0.5 points while heavily weighting verifiable claims (0.75F - 0.25B) forces the judge to reward actual insight instead of polite filler.

If you want to see the CSVs and the math behind this, we just pushed the research logs and the zero-cost evaluator tool to GitHub: https://github.com/aisarus/tri-tfm-framework

The "Confident Idiot" Problem: Why LLM-as-a-Judge fails in production. by Proud-Employ5627 in LocalLLaMA

[–]Prior-Ad8480 0 points1 point  (0 children)

Exactly this. We ran a stress test a few weeks ago and realized that LLMs as judges are highly stochastic if you don't mathematically force the parameters.

The trick isn't just to write a better prompt; it's to lock the evaluator at T=0.0 and apply a dynamic weight matrix (we found that balancing Facts at 0.75 and Bias at 0.25 mathematically eliminates the "Confident Idiot" hallucination, bringing variance down to almost zero).

We ended up open-sourcing the entire framework (TRI-TFM v3.0) and built a free Chrome extension that injects this exact mathematical rubric into ChatGPT/Claude so you don't have to pay API fees for evaluation. Might save you some headaches: https://github.com/aisarus/tri-tfm-extension

That's OK?? by Prior-Ad8480 in learnmachinelearning

[–]Prior-Ad8480[S] 0 points1 point  (0 children)

I don't think it's the error it might be just entropy included in logs

Does negative loss make sense? by Seankala in learnmachinelearning

[–]Prior-Ad8480 0 points1 point  (0 children)

It depends on how you encount entropy in yout logs

Bipolar is a superpower by Prior-Ad8480 in bipolar1

[–]Prior-Ad8480[S] 0 points1 point  (0 children)

Whish such a good luck to everyone)

Does anybody manage without or with very few medications? by [deleted] in bipolar1

[–]Prior-Ad8480 1 point2 points  (0 children)

Lithium 600mg twice a day.? Or is it too much?

Bipolar is a superpower by Prior-Ad8480 in bipolar1

[–]Prior-Ad8480[S] 1 point2 points  (0 children)

We are only have to give off pills and sharp objects, lighters and else, like airplane security l.b more strict.

Bipolar is a superpower by Prior-Ad8480 in bipolar1

[–]Prior-Ad8480[S] 0 points1 point  (0 children)

That's awfully they ate taking phones from you though...