World-Model-Based Evaluation of Robot Policies

A scalable evaluator that predicts real-world robot policy outcomes using a learned world model, eliminating the need for time-consuming manual rollouts.

Sentience Inc.

Abstract

Evaluating robot policies in the real world is expensive, slow, and often unsafe. We built a world-model-based policy evaluator that simulates policy execution directly in a learned dynamics model. This enables rapid, automated assessment of policy success and failure modes without requiring physical rollouts. We demonstrate strong qualitative alignment between real-world executions and world-model predictions across diverse manipulation tasks, substantially reducing evaluation time while preserving fidelity.

Methodology

Our approach extends a pretrained video world model to support action-conditioned prediction of robot behavior. Rather than generating unconstrained future video, the model predicts how the scene evolves when a specific robot policy is executed.

We build on NVIDIA Cosmos-Predict2.5, a diffusion-based video generation model originally designed for conditional future frame prediction. We modify the architecture to incorporate robot actions as an explicit conditioning signal, enabling the model to simulate policy rollouts over time.

The model takes as input a single initial observation image together with a sequence of robot actions. Given this input, it generates a short chunk of future video frames that follow the provided action sequence. To produce full trajectories, this process is repeated autoregressively: each newly generated frame is used as the conditioning image for the next prediction step.

We train the resulting action-conditioned world model on robot demonstration data from the LeRobot SO-101 dataset, which contains paired video observations and action sequences. The current model generates video rollouts at a resolution of 256×320 pixels and is able to simulate multi-step manipulation behaviors in a variety of tabletop tasks.

Results

The proposed world model is able to accurately generate robot policy rollouts in cases where the real-world policy execution is successful. In these scenarios, the predicted videos closely match the qualitative structure of real robot behavior, including object interactions, manipulation sequences, and final task outcomes.

This level of fidelity enables the model to serve as a practical evaluator for robot policies, allowing users to quickly assess whether a policy is likely to succeed before deploying it on physical hardware. In practice, this substantially reduces evaluation time and eliminates safety risks associated with real-world testing.

While the results are promising, the generated videos are not yet perfect. We observe visual artifacts and occasional inconsistencies, particularly in longer rollouts and during complex object interactions. These issues indicate that further training and larger-scale data are required to improve temporal consistency and visual realism.

Despite these limitations, the model demonstrates that action-conditioned world models can already provide meaningful signal for robot policy evaluation, especially for early-stage testing and iteration.

Task 1

Real-World Rollout

World Model Rollout

Instruction: Pick up the white tissue ball and place it in the black mug.

Task 2

Real-World Rollout

World Model Rollout

Instruction: Pick up the white tissue ball and place it in the black mug.

Task 3

Real-World Rollout

World Model Rollout

Instruction: Pick up the white tissue ball and place it in the black mug.

Task 4

Real-World Rollout

World Model Rollout

Instruction: Pick up the white tissue ball and place it in the black mug.

Demo

We demonstrate a world-model-based policy evaluator that predicts real-world robot behavior by simulating policy rollouts in a learned dynamics model. This enables efficient and scalable evaluation of robot policies without requiring physical execution.

Conclusion

While the results are promising, the current world-model-based policy evaluator has several important limitations. The generated videos are not fully faithful to real-world execution, exhibit visual artifacts and temporal inconsistencies, particularly in longer rollouts and complex object interactions.

These issues are amplified by the autoregressive generation process, where errors accumulate over time, and by bias in the training data distribution. In some cases, the model hallucinates plausible outcomes instead of strictly following the provided robot action sequence. Additionally, the current output resolution (256×320) limits visual fidelity.

Addressing action adherence, reducing hallucinations, improving temporal consistency, and increasing output resolution remain key challenges for future iterations of this work.