A scalable evaluator that predicts real-world robot policy outcomes using a learned world model, eliminating the need for time-consuming manual rollouts.
Evaluating robot policies in the real world is expensive, slow, and often unsafe. We present a world-model-based policy evaluator that simulates policy execution directly in a learned dynamics model. Our approach enables rapid, automated assessment of policy success and failure modes without requiring physical rollouts. We demonstrate strong qualitative alignment between real-world executions and world-model predictions across diverse manipulation tasks, substantially reducing evaluation time while preserving fidelity.