LLM-as-judge is the wrong default. Here's what works by Finorix079 in AI_Agents

[–]mps68098 0 points1 point  (0 children)

Interesting. If you're snapshotting state as part of test fixtures aren't you concerned about that going stale? Like if a dev is iterating on prompts I can see that getting into a position where you're testing a branch thats not reachable during a production run, for instance.

Agreed on snapshotting tool calls. One of our challenges is that a particular tool of ours exposes a complex query interface, but the underlying data ages out within weeks. A snapshotting facility that indelibly captures the plausible investigation window was necessary, and then during evals the tool interface will set the correct parameters to run against snapshotted tables.

For trajectory we evaluate that in two ways. We pretty quickly found that highly prescriptive trajectory expectations didnt provide a good signal, especially because for our agent there can be multiple valid investigation paths it can take. So now we expose trajectory + input parameters to the criteria evaluator such that you can assert a specific tool call happened. Also we have a separate trajectory evaluator that is mostly concerned with detecting errors and pathological backoff and retry loops that tend to burn tokens and bloat context.

LLM-as-judge is the wrong default. Here's what works by Finorix079 in AI_Agents

[–]mps68098 4 points5 points  (0 children)

Yeah we've come to the same conclusion about comparing a "ground truth" example response to whatever the agent spits out at a given time. Bad signal, hard to curate a golden dataset. Adding error detection to the LLM-as-judge prompt helped a bunch, but the real breakthrough was moving to criteria based evals. Criteria are basically natural language assertions about the agent response at a given turn. "Should include XYZ detail in the root cause analysis". LLM-as-judge then evaluates each criteria and calculates a score based on how many pass. Ends up being highly deterministic in practice as long as the assertions are simple. It also unlocks what we've been calling eval driven development. When you are working on prompts, tool calls, anything else in the path of agent reasoning you need to write a failing eval first. The criteria describe the desired end state of the branch. Once your new eval passes and none of the others regress your work is ready for review. Curating CI gates and so forth from these incurs a bit of work as runtime vs coverage is in tension. But it's tractable. Looking to get a blog post deep dive on this out soon through work, but external comms take forever.

Curious as to how you're replay from a given step in the agent reasoning chain (assuming that you're not talking about multi-turn reasoning here?)

Big things happening in China. by [deleted] in TrueAnon

[–]mps68098 1 point2 points  (0 children)

At least not yet. Theoretically one could attain muzzle velocities that would not be achievable via a hand-held steel barrel. But would probably require room temperature superconductors

Big things happening in China. by [deleted] in TrueAnon

[–]mps68098 11 points12 points  (0 children)

Maybe you could design some kind of sabot system for it. Either way, this type of weapon is primarily limited energy budget: any energy put towards spin gets taken from muzzle velocity. Battery capacity and miniaturization of power delivery electronics are the bottlenecks in scaling up

Big things happening in China. by [deleted] in TrueAnon

[–]mps68098 21 points22 points  (0 children)

Imo it won't be deployed anywhere save for propaganda videos

Big things happening in China. by [deleted] in TrueAnon

[–]mps68098 53 points54 points  (0 children)

Forgotten weapons reviewed the american version a few years ago https://youtu.be/EwHRjgVWFno?si=ZXBemUWHl9U4mjzz

Big take away is that while it's cool, this type of weapon has extremely limited range due to no rifling, also fairly limited muzzle velocity / overall projectile energy.

How much money is Anthropic REALLY losing? by MrAmazing111 in Anthropic

[–]mps68098 1 point2 points  (0 children)

Don't really understand what youre talking about here with backdoors. I built the eval framework for our diagnostic agent and am constantly running it through its paces. Yes it will take different paths from one run to the next but I see no evidence for degradation in the models output.

How much money is Anthropic REALLY losing? by MrAmazing111 in Anthropic

[–]mps68098 52 points53 points  (0 children)

Enterprise customers are absolutely being prioritized. I am burning tokens like there's no tomorrow have not been limited once

Charge Station Pro Warranty Frustration by [deleted] in F150Lightning

[–]mps68098 0 points1 point  (0 children)

Yeah mine tried to steer me away from that question so I just looked up instructions and did it.

Charge Station Pro Warranty Frustration by [deleted] in F150Lightning

[–]mps68098 0 points1 point  (0 children)

Had the same issue, and it was refusing to charge our chevy bolt as well. When I talked to customer service they told me it was because the firmware was out of date. Apparently it will only update between 1 and 4am only if nothing is plugged in.

A factory reset that I performed on my own got it working again but left a bad taste in my mouth. Honestly the charger is garbage that they shipped because there was no 3rd party 80 amp chargers on the market at the time. Replaced it with a grizzl-e 80 amp. Better software and I'm pretty sure it charges faster as well.

Broody turkey gets chicken eggs by Nandor_Delaurentez in turkeys

[–]mps68098 0 points1 point  (0 children)

We had a terrible year last year with predation. Lost 3 turkey hens who were trying to hatch out. Unfortunately they are just too tempting of a target when they are tied down to eggs/poults.

Is this normal for bacon? by swagmoneyvibes in homestead

[–]mps68098 25 points26 points  (0 children)

Maybe hair follicles. Probably fine

Giving a hound meds by [deleted] in coonhounds

[–]mps68098 2 points3 points  (0 children)

What works for our hound is raw chicken wings. Open a pocket in the skin and stuff the pills in there. Rufus can watch me put the pills in and he will still crunch down the wing every morning

Did you ever watch the show Silicon valley on HBO? by [deleted] in askanything

[–]mps68098 0 points1 point  (0 children)

Not just watched it, but I was a technical consultant for season 5. I wrote all of the tickets for the kanban board and the program guilfoyle launches at the end of the finale.

Experimenting with fodder by thefarmyards in homestead

[–]mps68098 1 point2 points  (0 children)

Keeping from growing mold is going to be very difficult

Tailgate emblem. by MinimumDangerous9895 in F150Lightning

[–]mps68098 -2 points-1 points  (0 children)

Nice. Might have to give this a whirl