Efficient-WAM

A 1B-Parameter World-Action Model with Low-Cost Future Imagination

Jiajun Li1,*, Tiecheng Guo2,*, Yifan Ye2,*, Rongyu Zhang2, Xiaowei Chi3,‡, Qianpu Sun2
Ying Li2, Yunfan Lou2, Yan Huang4, Zhihe Lu5, Meng Guo2, Shanghang Zhang2,

1The University of Hong Kong    2Peking University    3Muka Robotics
4Institute of Automation, Chinese Academy of Sciences    5Nanjing University

* Equal contribution Project lead Corresponding author

Overview of Efficient-WAM

Efficient-WAM uses low-cost future imagination to capture task-relevant object and robot dynamics without photorealistic video generation.

Abstract

World-Action Models (WAMs) have emerged as a promising paradigm for embodied control by coupling future visual prediction with action generation. However, most existing WAMs rely on photorealistic future prediction, which incurs high inference latency and makes real-time robot deployment difficult. This motivates a more efficient WAM design that preserves the control benefits of future visual prediction while reducing its inference cost.

We introduce Efficient-WAM, a World-Action Model that reduces the cost of future imagination while preserving its control benefit. Efficient-WAM improves inference efficiency via a compact video expert transferred from WAN-2.2-5B, token-sparse video latents, and asymmetric video-action denoising that allocates fewer sampling steps to video than to actions. Instead of optimizing the future branch for visual fidelity, Efficient-WAM treats future video prediction as a compact guidance signal for action generation. Comprehensive experiments on RoboTwin 2.0 and real-world manipulation tasks show that Efficient-WAM maintains strong action performance despite visibly coarse future predictions. While maintaining competitive control capabilities, our 1B-parameter model can reduce per-chunk latency to around 100 ms during physical deployment, achieving a 30x speedup over existing WAMs.

Action-centric future imagination

Efficient-WAM compresses the video branch along three axes: model size, future-token density, and denoising budget.

Efficient-WAM architecture

Compact video expert

A lightweight video expert is initialized from WAN-2.2-5B through structured layer slicing and teacher-guided distillation.

Token-sparse futures

Future frames are predicted as low-resolution latents while current observations remain high-resolution for action conditioning.

Asymmetric denoising

The video branch receives fewer denoising updates than the action branch, reducing latency while preserving action-centric structure.

Efficient control in simulation and the real world

Efficient-WAM matches heavyweight baselines on RoboTwin 2.0, while Efficient-WAM-RT provides a practical accuracy-latency trade-off and achieves the best average success with the fastest execution on real-world Astribot S1 tasks.

RoboTwin 2.0 Simulation

Average success rates over 50 tasks under clean and randomized settings.

Method Clean (%) Random (%) Params
VLA-based methods
π065.958.43.3B
StarVLA-α76.879.12B
π0.582.776.83.3B
ABot-M086.185.14.2B
LingBot-VLA86.585.34B
WAM-based methods
UWM81.778.65B
GigaWorld-Policy86.485.05B
Motus88.787.08B
Efficient-WAM86.785.71B
Efficient-WAM-RT83.182.0

Real-World Astribot S1

Success rates over four physical manipulation tasks.

Task π0.5 Motus Efficient-WAM-RT
Pipette-tray grasping1008595
Reagent-bottle transfer758075
LEGO color sorting306565
Pen uncapping102530
Avg. Success (%)53.7563.7566.25
Avg. Chunk Lat. (ms)113321598
Avg. Step Lat. (ms)7.1200.96.1

Coarse futures remain useful for control

Efficient-WAM-RT produces blurrier and lower-fidelity futures than the uncompressed video expert, yet preserves action-centric structure, motion direction, and contact layout.

Full WAN future prediction example
Full WAN configuration
Efficient-WAM-RT future prediction example
Efficient-WAM-RT

Real-world demos

We deploy Efficient-WAM-RT on the Astribot S1 across four physical manipulation tasks, covering precise tray grasping, gentle reagent-bottle transfer, long-horizon color sorting, and fine-grained bimanual pen uncapping.

Pipette-tray grasping
Reagent-bottle transfer
LEGO color sorting
Pen uncapping

BibTeX

@article{li2026efficientwam,
  title         = {Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination},
  author        = {Li, Jiajun and Guo, Tiecheng and Ye, Yifan and Zhang, Rongyu and Chi, Xiaowei and Sun, Qianpu and Li, Ying and Lou, Yunfan and Huang, Yan and Lu, Zhihe and Guo, Meng and Zhang, Shanghang},
  journal       = {arXiv preprint arXiv:2606.10040},
  year          = {2026},
  eprint        = {2606.10040},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/2606.10040}
}