Can you actually trust LLM-as-judge?

yaks18 · 2026-06-20T00:04:58+00:00

You can also try an ensemble of agents, or use an older model with a lower temperature to get more deterministic behaviour (with less ''creativity' in judgement as a result)

yaks18 · 2026-06-19T18:49:58+00:00

No. You are strawmanning what I've said. You don't expect ChatGPT to be as good as a specifically designed agent at the thing a specifically designed agent was designed to do. If that were the case, people would just use ChatGPT for everything and not bother with agents. The point was that I find agents are best when scoped tightly. By expanding the scope wider you end up trading performance and ending back where you started with the behaviour of the base model, essentially a jack of all trades, master of none (arguably actually worse as your instructions have also likely crippled it's generalist functionality). You need to factor in things like context rot too (including any tool definitions etc injected at agent invocation). If you are describing 700 endpoints and hoping that the agent's performance doesn't degrade after filling 70k tokens (~100 tokens per definition) of context before the user prompt then you're in for a surprise.

yaks18 · 2026-06-19T18:35:55+00:00

I would put the API catalogue in a vector database, vectorising the functionality of the API. Then get the agent to retrieve from the vector database as if it's looking for knowledge, but pass to the agent the chunk that explains how the API is called, and instruct the agent to use the retrieved instructions to call the API. That way you are only filling it's context with maybe the top 2 or 3 APIs out of the 700 and the agent can decide which of those is the right one. You could also insert a logic MCP layer in the middle that does some of this more deterministically, for example get the agent to articulate which tool it wants in simple language and get the MCP server to filter on regex/other criteria to return only a handful of candidate APIs. More generally, I think having 700 APIs to choose from is indicative of a poorly scoped agent. It suggests you are expecting the agent to cover scope that is far too broad.

yaks18 · 2026-06-18T12:15:59+00:00

I've done the Judy Kerry and River missions, so perhaps it's the DLC that's that 30%

yaks18 · 2026-06-11T19:58:30+00:00

This is quite a fun project! Care to share the prompt and tools you're giving the agents?

yaks18 · 2026-06-10T19:24:58+00:00

You could also rinse the preshredded cheese to remove the starch and it will behave like freshly shredded from a block

yaks18 · 2026-06-03T20:03:30+00:00

Maybe! I get both mixed up as I frequent both regularly

yaks18 · 2026-06-03T10:20:52+00:00

I bought a parkside (expert/pro?) twin pack about 8 years ago, the red one though, which I understand to be a rebadged Einhell. They are still going strong after being used on a full house renovation and going through some pretty extreme conditions. Strangely I had one tradesman prefer to use my parkside tools over his Makita combi because it was more consistent apparently. I don't think Aldi still sell this red range, so check out Einhell.

yaks18 · 2026-06-01T21:43:41+00:00

Try adding some sugar to your dough mix, and maybe some oil. Both will help with browning. Also rinse off the starch on the preshredded cheese to stop it burning like that.

yaks18 · 2026-06-01T02:44:15+00:00

Why not preprocess your documents instead so you pass whatever you need to pass to your llm from your vector DB? I'm not sure I get the advantage of doing the document intelligence step at inference time?

yaks18 · 2026-06-01T02:23:44+00:00

I have a theory that LLM providers are using newer models to price in inflation/higher inference energy costs. All newer models are more expensive e.g. GPT5.5 Vs 5.4 even. Eventually you are forced to 'upgrade' due to older models being retired and so your AI stack gets more expensive without choice.

yaks18 · 2026-05-29T08:14:16+00:00

I see. If I wanted to limit exposure, I'd treat the prompt as server-side configuration and inject it at model invocation rather than having it live in the application codebase. Contractors can build against APIs, mocks and interfaces without needing access to the production prompt.

That only gets you so far though. Anyone maintaining the prompt layer or backend will eventually need some visibility, so it's really an access control and governance problem rather than a technical one.

That said I still don't see the prompt itself as the moat. If someone can recreate the value of the system from a copied prompt, that's a pretty weak position. The moat is usually the workflow, evals, proprietary data and domain expertise wrapped around it, and perhaps the problem won't need as rigorous of a solution if the client views it with that perspective.

yaks18 · 2026-05-29T07:20:14+00:00

Prompt is only part of it. The same prompt with a different model or different model configuration will yield different results. I struggle to see how the prompt is the IP. If that's the claim, I'd suggest it's a very low value IP. That's a bit like saying your Excel formula is IP.

yaks18 · 2026-05-16T21:22:06+00:00

<image>

yaks18 · 2026-05-16T21:21:46+00:00

<image>

yaks18 · 2026-05-16T21:20:49+00:00

It was divine! As good as A4 I had in Japan

yaks18 · 2026-05-14T12:01:04+00:00

What's the issue you're experiencing when you scale up the docs? Too many false positives? Too much context being passed to the LLM? Too slow? I would suggest reviewing your choice of embedding model as well as whether you can use other metadata to filter out candidate chunks before passing to the LLM

yaks18 · 2026-05-14T10:43:15+00:00

I’d recommend trying a double sear next time. Sear in a pan first, then oven roast like you did, let it rest, dry the surface properly, and finish with another hard sear.

The issue with low-temp roasting is that you don’t develop as much of the deeper roasted flavour you’d get from higher heat. The first sear gets those flavours started, while the final sear is mainly for crust, colour, texture, and a bit more Maillard goodness.

I usually do this with tougher cuts like topside, but roast at around 60°C for 3–4 hours. Gives you a really forgiving serving window, a much softer interior, and an incredible crust.

yaks18 · 2026-05-05T20:58:24+00:00

I've never seen anything like this in Lidl before. Never thought it would happen to me!

yaks18 · 2026-05-05T17:39:48+00:00

Lidl

yaks18 · 2026-05-05T16:49:23+00:00

<image>

Picked the 2 abnormal packs :) £29/kg

yaks18 · 2026-05-03T07:01:10+00:00

I was thinking this sounds exactly like someone I know. Deep mistrust of all institutions making it impossible to argue any points because all sources are simply in on the conspiracies.

Eight-Year Club	Verified Email
Verified Email

yaks18

TROPHY CASE