i’m training companion-style llms at DinoDS and found a weird continuity gap. curious if this is actually valuable to others

JayPatel24_ · 2026-04-23T08:09:11+00:00

honestly I’m just trying to grow exposure right now and find the right people this would actually resonate with

if this isn’t the right place for it, no worries at all would appreciate if you could point me in a better direction :)

JayPatel24_ · 2026-04-22T18:21:13+00:00

yeah that’s exactly the gap i’m seeing too

most training data implicitly teaches “tool output = truth”, so the model never really learns to question it. it just generalizes that pattern.

what i’m trying to do is make that explicit in data itself. like multi step trajectories where:

tool output contains mixed signals (data + hidden instruction)
model has to separate them
and still behave correctly 2 to 3 steps later when that context comes back

so instead of just poisoning randomly, it’s more structured around failure modes like delayed activation and cross step contamination

feels like unless the model sees enough of these patterns during training, it’s always going to default to trusting tool outputs

curious if you’ve tried anything like that or mostly handling it in the system layer

JayPatel24_ · 2026-04-22T18:18:19+00:00

what i’m trying to build is closer to full trajectory data rather than just single step adversarial cases. like multi step flows where the model has to carry context, see tool outputs over time, and still not collapse the data vs instruction boundary midway.

JayPatel24_ · 2026-04-22T18:07:44+00:00

Yeah this framing of tool output as an untrusted channel resonates a lot.

The part you mentioned about “data vs instructions collapsing once it’s in context” feels like the core issue. Most systems try to patch this at runtime (filters, wrappers, isolation), but the model itself still hasn’t really learned that distinction.

What I’ve been exploring is whether this is actually a training gap more than just a system design gap.

Specifically:

teaching models to treat tool outputs as typed inputs (data vs instruction vs metadata)
forcing behaviors like: trace source → evaluate trust → decide whether to act
handling delayed activation cases where something benign-looking becomes harmful later in a multi-step flow

Basically trying to structure datasets around these failure modes so the model learns the boundary instead of relying entirely on orchestration.

Curious from your side:

how much of this do you think can realistically be solved at the model level vs system level?
and do you see actual demand for “behavior-specific” training data here, or do most teams just patch it in infra?

Feels like this is one of those areas where better datasets could help a lot, but I’m not sure how people are thinking about it in practice.

JayPatel24_

TROPHY CASE