DREAMSTEER: Latent World Models Can Steer VLA Policies During Deployment with Zero Finetuning

1Fundamental AI Research (FAIR), Meta 2University of Minnesota Twin Cities
*Work done during an internship at Meta Joint last authors
0 target-environment finetuning
+42.5% OOD success improvement
+17.5% instruction-following improvement
1

Think before acting

DreamSteer evaluates candidate VLA action chunks through imagined latent rollouts before real robot execution.

2

Plug-and-play steering

DreamSteer composes a frozen latent world model and value model at deployment time, without target-environment demonstrations or parameter updates.

3

Handle deployment shift

Real-robot experiments use unseen objects, distractors, and a different lab environment from training data.

Latent world model overview

DREAMSTEER teaser figure
Heterogeneous latent action-conditioned world model. Our world model learns from diverse embodiments, maps visual, proprioceptive inputs, and action into a shared latent space, and appends context tokens to capture recent interaction history.

Model architecture

Model architecture figure
Architecture of the Spatio-Temporal Transformer Block. The model processes RGB and Control Latents through N repeated layers as shown on the left, utilizing factorized spatio-temporal self-attention for efficiency. The Spatio-Temporal Cross-Attention mechanism on the right integrates control signals by performing independent spatial cross-attention per timestep and causal temporal cross-attention per patch.
World model rollouts
Unseen episode
RealImagined
RealImagined
RealImagined
Dexterous hands manipulation
RealImagined
RealImagined
RealImagined
Unseen environment
RealImagined
RealImagined
RealImagined

Abstract

Pretrained vision-language-action (VLA) policies show promising zero-shot generalization, but often fail under deployment-time distribution shift, leading to decreased robustness and inconsistent instruction following. While prior work commonly tackles this by finetuning on in-distribution data, it assumes demonstrations collected on tasks in the target environment. In this work, we propose DreamSteer, a deployment-time steering framework for pretrained VLAs without any finetuning or parameter modifications. The key insight in DreamSteer is to leverage a latent world model and a value model to steer pretrained VLA policies. During deployment, DreamSteer samples candidate action chunks from a VLA policy and predefined motion primitives, imagines their outcomes using an action-conditioned latent world model, and ranks the imagined trajectories with a language-conditioned value model. Across four real-world manipulation benchmarks with unseen objects, DreamSteer improves task success rate from 23.75% to 66.25% and instruction-following accuracy from 38.75% to 56.25% over the base VLA policy.

DreamSteer framework

DreamSteer overview figure
DreamSteer: deployment-time policy steering. A frozen VLA policy proposes candidate action chunks, which are augmented with a small set of predefined Cartesian action primitives. A latent world model predicts future observations, and a language-conditioned value model ranks the resulting trajectories before execution. All models are trained frozen, and no target-environment data is used during model training.

Real robot experiments

Out-of-distribution (OOD) results

Pick up the phone and place it in the brown box

π0 + DreamSteer
π0

Pick up the mustard and place it in the brown box

π0 + DreamSteer
π0

Pick up the whiteboard eraser and place it in the black bowl

π0 + DreamSteer
π0

Pick up the blue tape and place it in the black bowl

π0 + DreamSteer
π0

Instruction following (IF) accuracy

Pick up the banana and place it in the black bowl

π0 + DreamSteer
π0

Pick up the sponge and place it in the black bowl

π0 + DreamSteer
π0

Pick up the apple and place it in the brown box

π0 + DreamSteer
π0

Pick up the pencil case and place it in the brown box

π0 + DreamSteer
π0

Quantitative results

OOD object performance ↑. Success rates over 20 trials per object. The last column reports the aggregate success rate over all 80 trials with 95% Wilson confidence intervals.
Method Phone Mustard Tape Eraser Average 95% CI
π0 4/20 3/20 6/20 6/20 23.75 [15.84, 34.07]
π0 + DreamSteer 7/20 6/20 11/20 10/20 42.50 [32.26, 53.43]
π0 + primitives + random 0/20 0/20 0/20 0/20 0.00 [0.00, 4.58]
primitives + DreamSteer 0/20 0/20 0/20 0/20 0.00 [0.00, 4.58]
π0 + primitives + DreamSteer 12/20 11/20 16/20 14/20 66.25 [55.39, 75.65]
Instruction following performance ↑. Accuracy over 20 trials per target object. The last column reports the aggregate accuracy over all 80 trials with 95% Wilson confidence intervals.
Method Sponge Banana Pencil Apple Average 95% CI
π0 8/20 9/20 6/20 8/20 38.75 [28.78, 49.73]
π0 + primitives + DreamSteer 14/20 13/20 9/20 9/20 56.25 [45.34, 66.57]