Predictive Statistics Shape Emergent World Representations of Grid Walkers

arXiv CS Monday 08 June 2026, 04:00 UTC By Sasha Brenner, Thomas R. Kn\"osche, Nico Scherf 1 min read

Key Points

Announce Type: replace Abstract: Next-token predictors often appear to develop internal representations of the latent world and its rules. The probabilistic nature of these models suggests a deep connection between the structure of the world and the geometry of probability distributions. In order to understand this link more precisely, we use a minimal stochastic process as a controlled setting: constrained random walks on a two-dimensional lattice that must reach a fixed endpoint after a...

arXiv:2603.16689v2 Announce Type: replace Abstract: Next-token predictors often appear to develop internal representations of the latent world and its rules. The probabilistic nature of these models suggests a deep connection between the structure of the world and the geometry of probability distributions. In order to understand this link more precisely, we use a minimal stochastic process as a controlled setting: constrained random walks on a two-dimensional lattice that must reach a fixed endpoint after a predetermined number of steps. Optimal prediction of this process solely depends on a sufficient vector determined by the walker's position relative to the target and the remaining time horizon; in other words, the probability distributions are parametrized by the world's grid geometry. We train decoder-only transformers and recurrent networks on prefixes sampled from the exact distribution of these walks and compare their hidden activations to sufficient statistics of prediction, by measuring alignment and linear readability across layers. We find that the transformer's computation factors into two stages: the first attention block extracts the sufficient statistic from the input, and later layers transform it into the next-step predictive geometry. Across constraint variants the post-attention representation is universal: a shared world-state of the lattice that can be read directly as a world model, traced to the predictive geometry of the data. Later layers then specialize it to each variant's next-step distribution. Recurrent networks reach the same Bayes-optimal loss but do not isolate this world-state as a separate stage, showing that the world-model geometry also depends on architecture. Although demonstrated in a toy system, the results suggest that the geometry of the predictive distribution is a useful lens on how neural networks internalize the structure of their data.

Predictive Statistics Shape Emergent World Representations of Grid Walkers (ORG) Bayes (ORG)

Originally published by arXiv CS Read original →

Predictive Statistics Shape Emergent World Representations of Grid Walkers

Related Stories

Worker bees build a 'royal palace' for the honeybee queen

Starlink rival Qianfan hits satellite milestone, but is it too slow and costly?

Insta360's Luna Ultra takes on DJI's Osmo Pocket gimbal cameras

CIBC to Offer SpaceX Access Through Canadian Depositary Receipt