World-Model Based Policy Evaluation

Abstract

Evaluating robot policies in the real world is expensive, slow, and often unsafe. We present a world-model-based policy evaluator that simulates policy execution directly in a learned dynamics model. Our approach enables rapid, automated assessment of policy success and failure modes without requiring physical rollouts. We demonstrate strong qualitative alignment between real-world executions and world-model predictions across diverse manipulation tasks, substantially reducing evaluation time while preserving fidelity.