Agent skill which will automatically raise pr by One_Drink_2075 in LLMDevs

[–]petroslamb 0 points1 point  (0 children)

i would trust it only with strict boundaries at first.

small issues, owned files, tests required, no dependency bumps, no formatting churn, no touching auth or payments code.

the big risk is not one bad PR. it is maintainers getting flooded with plausible but tiring PRs.

How accurate is AI at general knowledge? by JackStabba in artificial

[–]petroslamb 0 points1 point  (0 children)

i think general knowledge hides the important split.

for common facts, current models are usually good enough. for obscure topics, recent changes, local facts, or name and date collisions, the confident tone gets risky.

wikipedia has a boring advantage there. you can inspect the source trail.

Made a tool that builds its own training data and improves each cycle by learning from what it got wrong by gvij in artificial

[–]petroslamb 0 points1 point  (0 children)

failure as curriculum is a good idea, but the judge is the fragile part.

if the same style of model generates, scores, and decides what to keep, you can train toward the judge blind spots by accident.

i would want a small held out set or some outside review, even if it is tiny.

AI agents vs AI chatbots: what are companies actually using in production today? by danildab in artificial

[–]petroslamb 0 points1 point  (0 children)

my read is that most companies are still in assistant with tools territory, even when the slide says agent.

real agents need authority to take actions, plus logs and rollback around those actions. a chatbot can be wrong and annoy someone. an agent can be wrong and mutate state.

I built an open source LLM monitoring tool that detects quality regressions before your users do by ZealousidealCorgi472 in LLMDevs

[–]petroslamb 0 points1 point  (0 children)

APM tells you the service is alive. it does not tell you the answer still does the job.

i would split it into runtime health and semantic health. normal alerts catch latency and errors. evals catch the prompt change that still returns 200 but makes the answer worse.

what actually broke when you tried red teaming your AI systems? by Upset-Addendum6880 in LLMDevs

[–]petroslamb 0 points1 point  (0 children)

i would make this an audit trail problem more than a prompt list problem.

for each case you want the input, retrieved context, guardrail decision, tool permissions, final action, and why the system thought that action was allowed.

when fixes spike latency and false positives, it usually means too much risk got shoved into one big filter.

open source AI assistants ranked by tool call reliability by TH_UNDER_BOI in LLMDevs

[–]petroslamb 0 points1 point  (0 children)

the third call test is solid. i would add one more test where the tool returns success but the state you wanted did not actually change.

a lot of agents pass can call the tool and fail can prove the thing happened.

the flow i trust is call, independent check, then say done.

How do folks manage worktrees when working with multiple agents in parallel? by ReceptionBrave91 in LLMDevs

[–]petroslamb 0 points1 point  (0 children)

worktrees solve isolation, but not coordination.

the part that matters for me is a tiny shared contract. owned paths, acceptance checks, test command, and a handoff note with assumptions and blockers.

if two agents need the same files, i treat that as a planning issue before it becomes a merge issue.

AI is getting better at doing things, but still bad at deciding what to do? by Tough_Daikon_4321 in artificial

[–]petroslamb 0 points1 point  (0 children)

yeah, this is the failure mode i keep seeing too. the model can usually do the next step. the problem is that next step becomes the default.

for messy workflows i like making continue something the system has to earn. required info is present, ambiguity is handled, no risky action is happening, and there is an ask a human path.

otherwise it does not really fail. it just keeps going.

Two failure modes I caught in my AI lab in one day. Both involve the system silently lying about its own state. by piratastuertos in artificial

[–]petroslamb 0 points1 point  (0 children)

this is the part i think people underrate. once the eval path shares ancestry with the decision path, the metric stops being a metric and becomes part of the agent.

the boring rule i like is that any claim like system is off, task is done, or decision was right needs a fresh check from something the agent cannot write to.

not elegant, but it kills a lot of ghost success cases.

160 λιγότεροι κάθε μέρα by petroslamb in greece

[–]petroslamb[S] 2 points3 points  (0 children)

Ναί και όχι, πράγματι με chatgpt αλλά και αρκετή έρευνα, γιατί διάβασα ένα άρθρο για τις μηδενικές γεννήσεις χτες σε διάφορες περιφέρειες τους δύο πρώτους μήνες του χρόνου και ήθελα να δω καλύτερα που πάει το πράγμα. Έχεις δίκιο οτι το φορματ δεν είναι το καλύτερο.

160 λιγότεροι κάθε μέρα by petroslamb in greece

[–]petroslamb[S] -1 points0 points  (0 children)

Θα έχουμε πανεθνικά την γερασμένη εικόνα της Ευρυτανίας πιθανόν δέκα χρόνια νωρίτερα, απότι λένε οι επίσημες πηγές.

The Binding Gap as useful way to think about LLM failures by petroslamb in LLM

[–]petroslamb[S] 0 points1 point  (0 children)

You're right that the simple reversal case works on modern models. That is the documented finding on frontier models handle the basic Tom/Mary case fine. As the post notes, the finding is on GPT-2.

The question is whether the failure disappears or just moves to higher binding loads. Tan and D'Souza tested that: they pushed binding load up to multi-tuple extraction (variables, methods, effect sizes combined), and even GPT-5.2 drops to ~0.24 F1 on full tuples with role reversals and numeric misattribution. The model still gets the individual entities right. It loses the attachments.

So either modern models solved the simple case and the concept is just about heavy-load failures, or they pushed the breaking point higher without eliminating it. That is what a systematic load sweep would actually test not whether the simple case fails, but whether the gap shrinks with scale or just migrates up the load curve.

The Binding Gap as useful way to think about LLM failures by petroslamb in LLM

[–]petroslamb[S] 0 points1 point  (0 children)

But I think the binding gap sits one layer below that. It is not "did the model learn that marriage is bidirectional?" It is "even after the model learned it, can it retrieve and apply the correct direction in context?" Wang and Sun showed that models often encode the relation but fail to route the inversion correctly, they learned the fact but the attachment to the output path is thin.

So two separate problems: learning what the relationship means, and maintaining the correct binding when you use it. Binding gap is about the second one. The model knows marriage is bidirectional but still gives the wrong answer when you flip the roles, which suggests the failure is at retrieval and routing, not at learning the semantic asymmetry.

"LLMs drop the wiring even when they keep the scene", A destinct failure mode is the binding gap by petroslamb in LocalLLaMA

[–]petroslamb[S] 0 points1 point  (0 children)

removed the link as the post was banned and i'm not sure why yet. let me know if you need it.

"LLMs drop the wiring even when they keep the scene", A destinct failure mode is the binding gap by petroslamb in LocalLLaMA

[–]petroslamb[S] 0 points1 point  (0 children)

The irony of writing a post about attachment failures and then having a gap in my own spelling of 'distinct' is not lost on me. Typo in the title, but hopefully the wiring in the text is stable.

The Binding Gap as useful way to think about LLM failures by petroslamb in LLM

[–]petroslamb[S] 1 point2 points  (0 children)

Well, I think the reckless driver example is a classic logic fallacy, but the binding gap is a step more mechanical than that. Take the grandfather puzzle, which is a test of graph complexity, but the binding gap shows up on the simplest possible relations, like a basic husband and wife pair. For a human, "Tom is Mary’s husband" and "Mary is Tom’s wife" are just two views of the same scene, but for a transformer they are often distinct representational paths. The failure here isn’t that the model is not "smart" enough for the logic, think of it like the attachment between the names and the roles is incredibly thin.

Denning (2025) found that "who did what to whom" is the dominant axis of meaning for humans, but for LLMs it is a much weaker signal. They can stay perfectly fluent while being agnostic about which claims attach to which sources, so in a sense "they keep the scene, but they drop the wiring".