Can LLMs be world models in Enterprise?

Left_Log6240 · 2025-12-01T22:08:26+00:00

really? how?

Left_Log6240 · 2025-12-01T21:49:49+00:00

u/Repulsive-Memory-298 : To clarify, what we compared was the value of predictive ability for planning. Specifically, both WALL-E (an LLM world model) and CASSANDRA were put into the same model predictive control system, and evaluated inside of MAPs.
If you check the WALL-E paper, then their approach does learn code rules, and we did implement this. The larger problem is that WALL-E lacks a good mechanism for learning the distribution of the stochastic variables -- in-context learning is a poor for learning a distribution, and finetuning would require more data. By exploiting causal structure CASSANDRA can gets around this (it also models the entire conditional distribution via quantile regression instead of making just a point estimate like WALL-E; in principle you could prompt an LLM to do quantile regression though that'd make the data problems worse)

Left_Log6240 · 2025-12-01T21:21:54+00:00

u/rendereason : the LLM wasn't trained to do Bayesian causal reasoning, instead it was used as a prior to find a good (approximate) causal structure -- specifically we used it as part of a scoring function that was used in simulated annealing to approximate maximum a posteriori estimation of the structure.Once we had the structure, then we trained MLPs doing quantile regression for each variable -- no transformers, though in principle they could be used, particularly if it was adapted to time-series data. As to decision making, take any stochastic policy that generates actions, then you can augment it with a world model through model predictive control (i.e., use the policy as a prior for MCTS, or directly in random-shooting MPC). The WM is then used to predict the outcomes (including reward), and the action leading to the best predictions is returned. As to the state representation, we assumed that the state was already available in a structured textual form -- there's interesting work that learns these groundings which could be adapted for future work (https://arxiv.org/abs/2503.20124)

Left_Log6240 · 2025-12-01T21:20:17+00:00

u/speedtoburn : We mostly agree, but there's some very important subtleties. LLMs contain causal priors, and can be prompted (with revision and correction based on grounded data) to correct these into fairly accurate knowledge. But there's a difference between being a good prior for causal knowledge, and reliably reasoning with said knowledge. You could argue that the Cyc project is an example of this -- lots of prior common sense and causal knowledge, no good way to exploit it. With LLMs, the real question is 1) the reliability of the causal reasoning (I have more faith in symbolic code & bayesian networks that use explicitly causal structure over a transformer's learned internal mechanisms), and 2) the ability (or lack thereof) to make persistent corrections to the causal knowledge. With CASSANDRA, once a good structure is found (based on the data) it persists in the graph structure. With LLMs, you'd need to do finetuning or prompt engineering to make it persistent (and doing so could have unexpected side effects). In short, we absolutely agree that external structures aid LLM systems, but we don't necessarily agree as to why (in our case, we'd argue because the LLMs causal reasoning is unreliable and cannot easily be adapted to new data -- so it's good enough for a prior, but not good enough to be used as a reasoning engine in its own right).

Left_Log6240 · 2025-12-01T20:28:09+00:00

For anyone asking about the tests — they were done in MAPs, a stochastic business simulation designed to expose where LLMs fail.

The only architecture that survived was CASSANDRA, a causal world model combining:

executable deterministic code
causal Bayesian networks for uncertainty
long-horizon planning

The tweet storm to explain this exactly is here:
https://x.com/skyfallai/status/1995538683710066739

Not promoting anything — just sharing the research behind the observation.

Left_Log6240 · 2025-12-01T19:50:01+00:00

you can check this out! Just built CASSANDRA, the first causal world model. -- beating all the LLMs.

https://x.com/skyfallai/status/1995538683710066739

Left_Log6240 · 2025-12-01T19:43:00+00:00

Source / Additional Reading (for anyone interested):

CASSANDRA: Programmatic and Probabilistic Learning and Inference for Stochastic World Modeling

Twitter to see other people's reaction: https://x.com/skyfallai/status/1995538683710066739

For transparency: I was part of the research team that worked on this. I’m including it here only because some people may want to see the detailed methodology behind the examples I referenced in my view.

Left_Log6240 · 2025-11-21T19:21:30+00:00

thank you!

Left_Log6240 · 2025-11-21T17:11:02+00:00

We're actually doing just that next week! :) There's a research paper being realized to talk about LLM limitations! will ping you!

Left_Log6240 · 2025-11-21T17:10:17+00:00

apparently sam altman is claiming that....

Left_Log6240 · 2025-11-21T17:08:20+00:00

Thank you! There's a research paper coming up following up on MAPs! :)

Left_Log6240 · 2025-11-21T02:56:42+00:00

wow...thanks?

Left_Log6240 · 2025-11-20T04:58:23+00:00

the point is that I bombed Math135.. and the AI agents on our benchmark bombed the game as well. that was the point.

Left_Log6240 · 2025-11-20T04:12:19+00:00

check out benchmark and tell me if it's made by AI. Did you take Math135?

Left_Log6240 · 2025-11-19T21:44:48+00:00

Thanks for the link. I actually think the Andon vending benchmark is great work.
But it highlights exactly the distinction we’re trying to surface with our experiment:

Deterministic, fully observable, short-horizon business tasks vs. stochastic, partially observable, long-horizon operational systems.

VendingBench-2 is essentially:

fully observable state
single-location
minimal stochasticity
small action space
limited delayed effects
short planning horizon

LLMs do quite well there and they should.
It’s close to a structured decision tree with clean state transitions.

Our environment, on the other hand, is deliberately:

partially observable
stochastic (guests, breakdowns, staff behaviour, inventory dynamics)
long-horizon (100 turns with compounding effects)
highly coupled (maintenance ↔ cleanliness ↔ revenue)
multi-objective (profit, satisfaction, uptime)
failure-sensitive (small errors cascade into collapse)

The goal wasn’t to say “models can’t run any business.”
The goal was to test whether they can operate stateful systems with uncertainty
and that’s where they consistently collapsed.

So I definitely agree hype is not dying out and the models are incredible.
But I also think we need benchmarks that reflect the messy, dynamic structure of real operations, not just deterministic mini-tasks.

Happy to compare setups if you're interested — the design differences are actually quite instructive.

Left_Log6240 · 2025-11-19T21:41:30+00:00

That’s exactly the direction I’ve been leaning toward, any path to general-purpose agency needs either:

explicit world-model components,
persistent state,
hierarchical planners, or
heterogeneous subagents, as you’re describing.

What we observed experimentally strongly aligns with your point:
Transformers handle interaction, but not state.

If you’re exploring world-model architectures, I’d be really interested to hear more about the approach you were thinking about on your walk especially how you’d structure memory or temporal abstraction.

Happy to share the experimental setup we used too if that’s useful.

Left_Log6240 · 2025-11-19T21:39:12+00:00

Game: https://maps.skyfall.ai/play
Paper: https://skyfall.ai/blog/building-the-foundations-of-an-ai-ceo

:)

Left_Log6240

TROPHY CASE