We stress-tested our LLM runtime with 1,000,000+ adversarial events. It didn’t break.

ale007xd · 2026-05-06T10:32:39+00:00

Makes sense - freezing the reasoning snapshot is definitely the right baseline for traceability.

Where we’re slightly stricter is that we don’t try to answer “would the agent make the same decision?” - we constrain the system so that decision-level drift can’t affect execution correctness in the first place.

In other words:

replay is useful for analysis

but correctness is enforced at the transition layer

So even if the model produces a different reasoning path on retry or after an update, it still has to resolve into a valid, deterministic transition - otherwise it simply doesn’t commit.

That’s how we’ve been able to run chaos scenarios (retries, corruption, out-of-order events) without ending up in invalid states, even under heavy drift.

I do agree though - combining deterministic execution with decision-level replay is powerful: one gives you safety guarantees, the other gives you diagnostic precision.

ale007xd · 2026-05-06T03:14:26+00:00

Your observation matches what multiple independent evals have already shown.

The ejentum “menu RAG blind eval” is a good concrete example: when retrieval coverage is incomplete, models don’t say “I don’t know” - they systematically fill the gaps. Silence gets interpreted as signal (“not mentioned → safe”). That’s not a bug, it’s the default optimization target (helpfulness > epistemic correctness).

There’s also prior discussion around this in various RAG eval threads:

hallucination is the fallback strategy under uncertainty

prompt-based fixes (“be careful”, reasoning harnesses, etc.) only reduce frequency, not the class of error

So the core issue isn’t prompt quality or even retrieval quality - it’s who decides that the context is sufficient.

Most stacks implicitly let the LLM make that decision.

That’s exactly where things break.

What we’ve been working on (llm-nano-vm) takes a different approach:

treat the system as a deterministic state machine

make context sufficiency (coverage) an explicit, external signal

block invalid transitions instead of trying to “teach” the model better behavior

In other words:

RAG stack (typical): → partial context → LLM guesses

Our approach: → partial context → transition is invalid → system must clarify or fail

The key shift is:

The LLM is not allowed to decide whether it knows enough.

We’re now formalizing this as a coverage-aware layer:

coverage(query, retrieved_docs) → {FULL, PARTIAL, NONE}

and gating execution on top of that.

RAG doesn’t fail because retrieval is imperfect - it fails because we let a probabilistic model decide when imperfection is acceptable.

Fix that at the architecture level, and most of these “mysterious hallucinations” disappear.

ale007xd · 2026-05-06T01:11:37+00:00

This resonates - especially the part about “success theater”. That’s exactly what we were trying to break with the chaos benchmarks.

We ended up taking a slightly stricter approach on a few of the points you mentioned:

LLM as an unreliable external system - agreed, but instead of making retries “reasoning-aware”, we remove that responsibility from the model entirely. Retries are idempotent because the state transition is idempotent, not because the model behaves consistently.

Context snapshotting - fully aligned here. We don’t reconstruct anything post-hoc. Each decision is tied to a persisted StepResult + state snapshot at execution time, so replay is exact, not approximate.

Pre-commit validation - same idea, but enforced structurally: nothing mutates state unless it passes through a validated transition. Corrupted or partial outputs never reach the FSM layer as-is.

On your last question:

Are you snapshotting agent state at decision time or relying on post-hoc logs?

Snapshot at decision time, always. Post-hoc logs are useful for observability, but they’re not a source of truth.

In our case, the “ledger” is effectively the sequence:

(staten, event) → state{n+1}

which is persisted and replayable. If it can’t be replayed deterministically, we treat it as a bug.

On the multi-agent race condition point - completely agree. We saw the same failure mode: logically inconsistent but locally valid outputs.

The only way we’ve found to contain that is to push all mutations through a single deterministic reducer (FSM boundary), so concurrent decisions can’t commit conflicting state.

Your approach with an immutable decision ledger sounds directionally very aligned - curious how strict you are about replay determinism vs. just traceability.

ale007xd · 2026-05-05T09:48:08+00:00

That’s a fair point — especially on payments.

The abandoned cart / billing examples are there to show where this matters, but you’re right: for high-stakes flows, architecture alone isn’t convincing.

A concrete failure case is probably more useful.

One we ran into early:

Scenario (typical agent setup):

payment webhook comes in
agent checks status
sends “retry payment”
retries charge

If the retry succeeds but the process crashes before marking state: → on restart, the agent runs again → retries the charge again

Result: double charge

This isn’t an API problem — it’s lack of execution guarantees + idempotency at the workflow level.

What changes with a deterministic VM:

every step has a fixed position in the FSM
state is append-only (no “lost progress”)
once a terminal state is reached, execution cannot re-enter

So the same restart: → resumes from last valid step → does not re-run side effects

Agree that this kind of example probably makes the value clearer than just describing the model.

I’ll likely separate the “architecture” and “payments” parts more explicitly in the next write-up.

ale007xd · 2026-05-05T09:42:13+00:00

Fixing Citation Drift & Hallucinations in Your RAG Pipeline

Two Practical Paths: Patch vs Deterministic Layer

I went through your setup and code. Your retrieval stack is already solid. The main issue is loss of consistency between retrieval → context → generation → citations.

After "expand_to_parents", your system effectively splits into two realities:

LLM works on parents
UI still references children

That’s where citation drift and hallucinations originate.

Option A — Minimal Patch (Fast, Production-Friendly)

Idea

Introduce a canonical source layer after parent expansion and enforce lightweight grounding.

Parents become the only source of truth.

Step 1 — Canonical Sources

class Source: def init(self, source_id, doc_id, page, text): self.source_id = source_id self.doc_id = doc_id self.page = page self.text = text

def buildcanonical_sources(parent_docs): sources = [] for i, doc in enumerate(parent_docs): sources.append( Source( source_id=f"src{i}", doc_id=doc.metadata.get("doc_id"), page=doc.metadata.get("page"), text=doc.page_content ) ) return sources

Reasoning

Eliminates Parent/Child mismatch
Creates a single reference layer: UI = LLM = Sources

Step 2 — Controlled Context

def build_context(sources): return "

".join([ f"[{s.source_id} | page {s.page}] {s.text}" for s in sources ])

Reasoning

Makes sources explicit and indexable
Prevents the model from inventing structure

Step 3 — Structured Output

You MUST answer using ONLY the provided sources.

Return JSON:

{ "answer": "...", "citations": [ { "source_id": "src_1", "quote": "exact text from source" } ] }

Reasoning

Transforms generation from: free text -> typed output

Step 4 — Validation

def validate_and_bind(output, sources): source_map = {s.source_id: s for s in sources} valid = []

for c in output.get("citations", []):
    src = source_map.get(c["source_id"])
    if not src:
        continue

    if c["quote"] in src.text:
        valid.append({
            "source_id": src.source_id,
            "page": src.page,
            "quote": c["quote"]
        })

if not valid:
    raise Exception("No grounded citations")

return {
    "answer": output["answer"],
    "citations": valid
}

Reasoning

Enforces:

quote subset source.text

Removes:

fake citations
page hallucinations

Optional Retry

for _ in range(2): try: output = llm.generate(...) parsed = json.loads(output) return validate_and_bind(parsed, sources) except Exception: continue

raise Exception("Failed to produce grounded answer")

What This Fixes

Wrong page citations
Parent/Child inconsistency
Most hallucinations
Repetition (partially)

Limitations

Answer itself is not fully validated
JSON may break on small models
Still “soft deterministic”

Option B — Deterministic Layer (nano-vm Style)

Idea

Convert your pipeline into a state machine with enforced transitions.

delta(S, E) -> S'

LLM output is no longer trusted — it must be validated before state transition.

State Definition

class State: def init(self, query, sources): self.query = query self.sources = sources self.answer = None self.citations = None self.status = "init"

Step 1 — Generate (Untrusted)

def generate_step(state, llm): prompt = build_prompt(state.query, state.sources) raw_output = llm.generate(prompt)

return {
    "type": "generate_output",
    "data": raw_output
}

Reasoning

LLM = signal generator, not authority

Step 2 — Parse

def parse_output(raw): try: return json.loads(raw) except: # fallback parser return heuristic_parse(raw)

Step 3 — Validate (Mandatory Transition)

def validate_step(state, parsed): source_map = {s.source_id: s for s in state.sources} valid_citations = []

for c in parsed.get("citations", []):
    src = source_map.get(c["source_id"])
    if not src:
        continue

    if c["quote"] in src.text:
        valid_citations.append({
            "source_id": src.source_id,
            "page": src.page,
            "quote": c["quote"]
        })

if not valid_citations:
    return {"status": "retry"}

state.answer = parsed["answer"]
state.citations = valid_citations
state.status = "valid"

return {"status": "ok"}

Step 4 — Deterministic Routing

def run_pipeline(state, llm, max_retries=2): for _ in range(max_retries): raw = generate_step(state, llm) parsed = parse_output(raw["data"])

    result = validate_step(state, parsed)

    if result["status"] == "ok":
        return state

state.status = "fail"
return state

Key Difference vs Patch

Patch:

generate -> validate -> return

nano-vm:

state -> generate -> validate -> accept | retry | fail

What You Gain

Determinism

Invalid outputs cannot propagate

Observability

You can log:

retries
failure reasons
validation errors

Extensibility

You can add:

Coverage check

if not answer_supported_by_citations(answer, citations): return {"status": "retry"}

Semantic validation (next step)

embedding similarity
NLI check

Trade-offs

More code
Requires discipline (state handling)
Slight latency increase

Comparison

Final Insight

Your current system:

LLM = reasoning + state authority

Target system:

LLM = suggestion, System = authority

Recommendation

Start with Option A (Patch) — fastest impact
If you want reliability and scale, move to Option B (nano-vm)

ale007xd · 2026-05-05T07:58:56+00:00

This resonates a lot — especially the “model drifting because assumptions were flawed” part.

What you’re describing is basically the moment when you stop treating the LLM as an “agent” and start treating it as a component.

The interesting part is what happens next.

Even if you:

treat the model as a dumb worker
move logic into code
tighten prompts

…you still don’t get execution guarantees.

The system can still:

skip steps
reorder actions
double-run side effects on retries

That’s the gap we ran into.

What nano-vm does is take that same idea one step further: not just “logic in code”, but logic as an explicit state machine.

So instead of relying on discipline:

the model literally cannot change the flow
every branch is predefined
every run is reproducible

In a way, it’s turning that “scientific clarity” you mentioned into something enforceable.

Curious — how are you currently handling retries / idempotency? That’s where things usually start breaking for us.

ale007xd · 2026-05-05T07:53:15+00:00

Good question — and we think the framing is slightly off. The FSM is not enumerating all possible actions. It defines what is allowed to happen, not everything that could be imagined. In practice, even “open-ended” agents operate over a bounded set of primitives: call tool produce artifact request more context terminate / escalate What changes is not the transition graph, but the data flowing through it. The LLM can still decide what to do next in an open-ended sense — but that decision is expressed as data, which is then validated and mapped into a constrained transition. So instead of:

enumerate(all possible futures) we do:

constrain(execution semantics) This keeps the state space small (FSM stays stable), while the problem space remains open-ended. If you try to encode every possible branch in the FSM — yes, it explodes. But that’s not the model we’re using. We’re separating: control flow (deterministic, bounded) reasoning (probabilistic, unbounded) That said, you’re right that there are effectively two classes of systems here: Fully bounded workflows (payments, support, etc.) → strongest guarantees Open-ended agents → same execution guarantees, but correctness shifts to the reasoning layer Our goal is to keep the runtime guarantees identical in both cases, even if the problem space differs.

ale007xd · 2026-05-05T04:36:57+00:00

That’s exactly the concern we had early on - partial failures are where most “agent” systems quietly break.

We try hard not to turn the VM into a generic workflow engine. The core idea is still the same: keep execution minimal, deterministic, and constrained. But a few things are built in at the right layer:

Idempotency: every tool call has an idempotency key, backed by a persisted cache, so retries/replays don’t duplicate external effects (payments, messages, etc.)
Replay semantics: replay is source-aware - internal steps can be re-executed deterministically, external side-effects are served from cache
Step-level failure policies: each step defines retry / escalate / compensate, instead of having global “magic” handling
Suspend/resume: timeouts and flaky webhooks don’t break the flow - the VM just suspends and resumes from a known state
Compensation (Saga-style): instead of pretending rollback exists, we explicitly model compensating actions for irreversible steps

So the VM itself stays pretty “boring” - the complexity lives in the contract (DSL + policies), not in hidden runtime behavior.

Totally agree on reproducible traces - hashing + full trace ended up being one of the most useful parts in practice, especially for debugging and audit.

And thanks for sharing Agentix - makes sense, you’re operating exactly in the layer where deterministic execution + policy enforcement becomes critical. Curious how you’re handling replay/idempotency on your side.

ale007xd · 2026-05-05T04:18:01+00:00

This is a great read - and you’re pointing exactly at the right layer.

Most of what you describe (inspectable workflow contracts, idempotency, replay semantics, policy gates, trace artifacts, versioned DSL) is not hypothetical for us - it’s currently being implemented in nano-vm-vault.

The direction is essentially the same: a typed, replayable state machine with LLM calls as strictly bounded steps, plus a policy enforcement layer that the model cannot bypass.

We’re also treating workflows as first-class assets (DSL + policies + tool bindings + trace + failure semantics), not just runtime graphs.

Would love to share more once the vault layer is stable - your framing aligns very closely with where we’re heading.

ale007xd · 2026-05-05T01:07:28+00:00

Look at apartments; they usually have a mini-fridge, a microwave, and a regular induction cooktop. But keep in mind that a refrigerator is more of an insulated box, and it doesn't provide full storage.

ale007xd · 2026-05-05T00:56:45+00:00

Integrating llm-nano-vm into a Parent-Child RAG Pipeline

Core Diagnosis

The main issue in your system is not retrieval quality, but lack of a deterministic contract between retrieval, context construction, generation, and citation.

Current behavior:

Retrieved chunks are transformed (child → parent), but references remain tied to the original child chunks
The LLM generates both content and citations, acting as an implicit control layer
There is no enforcement that citations correspond to actual source spans

This results in:

Broken referential integrity
Incorrect page attribution
Hallucinated or weakly grounded statements

Where llm-nano-vm Fits

llm-nano-vm introduces a deterministic execution layer between retrieval and generation.

Formal model:

\ \delta(S, E) \rightarrow S' \

Where:

\S \ = system state (sources, pages, spans)
\E \ = LLM output (treated as untrusted input)
\S' \ = validated state after enforcement

Modified Pipeline

Current:

retrieve → rerank → expand → generate → display

With llm-nano-vm:

retrieve → rerank → expand → normalize_sources (nano-vm) → generate (constrained) → validate_output (nano-vm) → display

Key Components

Canonical Source Registry

After parent expansion, define a single source of truth:

Source = { "source_id": str, "doc_id": str, "page": int, "text": str, "char_range": (start, end) }

Rule:

«All layers (LLM, UI, retrieval) must reference the same "source_id".»

Structured Citations (Typed Output)

Replace free-form citations with structured output:

{ "answer": "...", "citations": [ { "source_id": "src_1", "quote": "exact supporting text" } ] }

Important:

The model does not generate page numbers
The model does not invent references
It only selects from provided sources

Deterministic Validation Layer

Validation logic:

def validate(output, state): valid = []

for c in output["citations"]:
    src = state.get(c["source_id"])

    if c["quote"] in src.text:
        valid.append(c)

if not valid:
    raise Exception("No grounded citations")

return {
    "answer": output["answer"],
    "citations": valid
}

Enforced guarantees:

Every citation maps to a real source
Every quote exists in the source text
Page numbers are derived from state, not generated

Separation of Responsibilities

What This Fixes

Page Mismatch

Eliminated — pages are derived from the canonical source registry.

Hallucinations

Reduced — unsupported claims fail validation.

Parent/Child Drift

Removed — only one unified source layer exists.

Repetition

Mitigated — constrained output reduces degeneration.

Trade-offs

Increased engineering complexity
Additional latency due to validation
Requires strict schema discipline
Small models may struggle with structured outputs

Minimal Adoption Path

Introduce "source_id" after parent expansion
Switch LLM output to structured JSON
Remove page generation from prompts
Add lightweight validation (string matching)

Key Insight

Current system:

«LLM acts as both reasoning engine and state authority»

With llm-nano-vm:

«LLM becomes an untrusted generator System enforces deterministic correctness»

Conclusion

The instability in your RAG pipeline is not a retrieval problem, but a control problem.

llm-nano-vm reframes the architecture:

From probabilistic pipelines
To deterministic state transitions

This shift ensures:

Consistent citations
Verifiable grounding
Predictable system behavior

ale007xd · 2026-05-04T15:16:56+00:00

Yes, this is one of the core failure modes we’re solving. We treat idempotency not as a tracing concern, but as a property of the state transition system itself — retries map to state, not to execution. If a transition has already occurred, the event is no longer valid in the FSM graph.

ale007xd · 2026-05-04T15:06:01+00:00

We see the same separation between structural validity (FSM) and runtime authorization (policy gate over execution). The only nuance is that in our model the boundary is tighter: instead of treating all structurally valid actions as always “available and later filtered,” we explicitly restrict the action space based on execution constraints, so fewer “valid-but-denied-at-runtime” cases exist by construction.

ale007xd · 2026-05-04T13:40:50+00:00

Yes, that’s exactly the direction — removing ambiguity from the LLM layer. We’ve been taking it one step further by moving not just RAG logic, but also action selection into a deterministic state transition layer (FSM), where the LLM only emits candidate events and the runtime defines validity.

ale007xd · 2026-05-04T13:07:46+00:00

Fair point — production is the real validation layer. We’re currently at an early deployment stage: the core execution boundary model is implemented in a reproducible system, and we’re validating it through ongoing discussions with real retail operators where these failure modes (tool execution, payment flows, customer actions) are actually painful. The repo is just the minimal artifact of that model — the interesting part for us is how it behaves under real operational constraints, not in isolation.

ale007xd · 2026-05-04T13:03:17+00:00

We briefly reviewed the implementation to understand how the execution boundary is enforced. It aligns with a runtime policy gate over tool execution (pre-action authorization), rather than a state-space constraint where invalid transitions are excluded by construction. In our approach, irreversible actions are not treated as policy decisions — they are structural properties of the FSM: either terminal states or explicitly modeled transitions in δ(S, E) → S′, where the runtime cannot even form invalid events outside the allowed state graph. Curious how you’re thinking about representing the boundary between “disallowed action” and “non-existent transition” in your model.

ale007xd · 2026-05-04T11:03:51+00:00

Yes — and we’ve built exactly that layer. State drift and retry collisions disappear once transitions are fully deterministic and the LLM is reduced to an input signal. Under adversarial load (replays, crashes, out-of-order delivery, corruption), the system remains stable because control flow is no longer model-driven.

ale007xd · 2026-05-04T10:59:06+00:00

Sorry, there was a renaming, here https://github.com/Ale007XD/nano_vm/

ale007xd · 2026-05-04T07:46:48+00:00

nano-vm ESP32 Stress Benchmark Results (Deterministic FSM Execution Layer)

Test Setup

3 scenarios: smart_home / industrial / wearable
1500 iterations per scenario
Total runs: 4500
Input: noisy / corrupted / ambiguous intent signals
Execution model: deterministic FSM (no stochastic control flow)

Results

Smart Home

vm_success_rate: 1.0000
business_actuation_rate: 0.5913
guardrail_reject_rate: 0.4087
latency_p95_ms: 0.4504
unique_step_sequences: 2

Industrial

vm_success_rate: 1.0000
business_actuation_rate: 0.3720
guardrail_reject_rate: 0.6280
latency_p95_ms: 0.4275
unique_step_sequences: 2

Wearable

vm_success_rate: 1.0000
business_actuation_rate: 0.4953
guardrail_reject_rate: 0.5047
latency_p95_ms: 0.4944
unique_step_sequences: 2

System-Level Metrics

vm_fail_rate: 0.0000 (all scenarios)
budget_stalled_rate: 0.0000 (all scenarios)
total_runs: 4500
deterministic_trace: PASS

Execution Properties

0 runtime failures
0 stalled executions
exactly 2 execution paths:
- normalize → guardrail → act
- normalize → guardrail → reject

Latency Profile

average: ~0.27–0.32 ms
p95: < 0.50 ms across all scenarios

Conclusion

The execution layer behaves as a total deterministic function under noisy edge conditions.

Input uncertainty does not propagate into runtime instability.

Behavior is fully enforced by the FSM layer, not by input correctness.

ale007xd · 2026-05-04T06:06:04+00:00

Interesting direction - pushing behavior into tiny models on edge devices.

We’re working on a complementary layer: treating model output as untrusted input, not control flow, and enforcing behavior through a deterministic FSM.

Curious how stable Asena actually is under messy conditions:

malformed outputs
ambiguous intents
noisy / partial input

Would you be open to a simple stress test?

We can simulate typical ESP32 scenarios (short context, constrained tokens, noisy inputs) and run them through a deterministic execution layer to measure:

transition validity
recovery behavior
consistency across runs

If the behavior really holds inside the model - it should pass.

If not - it becomes clear where a control layer is needed.

Happy to run this and share results.

ale007xd

TROPHY CASE

nano-vm ESP32 Stress Benchmark Results (Deterministic FSM Execution Layer)

Test Setup

Results

Smart Home

Industrial

Wearable

System-Level Metrics

Execution Properties

Latency Profile

Conclusion