World Action Models / Robot Learning

SANTS

A State-Adaptive Scheduler for World Action Models

World Action Models do not always need a fully denoised future. SANTS selects the action-useful intermediate video state at inference time, preserving future reasoning while removing redundant video denoising.

94.4% RoboTwin 2.0 success
73.1% real-robot mean success
81.7% RoboTwin average latency reduction
79.0% real-robot average latency reduction

One-minute summary

Adaptive video denoising for action, not pixels.

Problem

Full video denoising is expensive, and the final clean video is not always the best condition for action generation.

Insight

Action utility along the video noise trajectory is state dependent: coarse motion often needs less refinement than contact-rich manipulation.

Solution

SANTS learns when to stop denoising and how far to jump if more future evolution remains useful.

Comparison between a fixed denoising schedule and the SANTS adaptive schedule.
Fixed denoising spends the same budget everywhere. SANTS instead jumps through easy states, refines contact-rich states, and stops before redundant future-video updates.

Motivation

The best future condition is state dependent.

Controlled depth scans show that video refinement can reduce action error, but the gain saturates and sometimes reverses. Coarse phases often obtain enough action cues after shallow denoising, while fine contact and alignment phases benefit from deeper future evolution.

This turns WAM inference into a state-dependent selection problem: choose the intermediate video representation that helps action generation, rather than always waiting for the fully denoised endpoint.

Denoising depth diagnostic curves comparing action error across coarse and fine phases.
Denoising-depth diagnostic: action usefulness changes along the video trajectory and across task phases.

Method

Stop or jump along the noise trajectory.

SANTS attaches a lightweight scheduler to a frozen video-action diffusion policy. At each video decision point, it reads the current video-state representation and noise level, then predicts both stopping evidence and a relative noise-progression ratio.

SANTS architecture figure from the paper showing the frozen video-action policy and adaptive scheduler.
SANTS keeps the WAM backbone frozen and learns a scheduler optimized by downstream action quality and inference cost.
01

Read state

Use the pooled video representation and current noise level as scheduler state.

02

Stop

Accumulate hazard evidence to decide whether the current intermediate video state is sufficient.

03

Jump

If continuing, predict a relative progression ratio instead of following a fixed denoising grid.

04

Act

Pass the selected terminal video representation to the frozen action branch.

Results

Better success-latency tradeoffs than fixed denoising.

Success-latency tradeoff plot highlighting SANTS.

SANTS keeps strong WAM-style future reasoning while removing much of the video-denoising cost.

RoboTwin 2.0

94.4% overall success at 523.7 ms average inference latency across RoboTwin tasks.

Ablations

Adaptive stop and adaptive jump are complementary; using both gives the strongest success-latency tradeoff.

Setting Success Avg. latency Message
Full-denoising WAM 92.2% 2868.4 ms Accurate but slow
SANTS 94.4% 523.7 ms Best success with large average latency reduction
Ablation plot comparing fixed schedules, stop-only, jump-only, and full SANTS.
Ablation: terminal-state selection and relative noise progression work best together.

Robot evaluation

SANTS improves real-robot manipulation under tight average control latency.

Across bimanual manipulation and UR10 kitchen tasks, SANTS reaches 73.1% mean success at 581.3 ms average policy latency computed over all seven tasks by selecting intermediate future-video states without running full video denoising at every decision.

AgileX dual-arm

Charger insertion

AgileX dual-arm

Backpack packing

AgileX dual-arm

Sock placement

AgileX dual-arm

Clothes folding

UR10 kitchen

Plate transfer

UR10 kitchen

Fridge placement

UR10 kitchen

Fruit sorting

What SANTS learns

SANTS spends denoising where actions need it.

The scheduler does not simply run fewer steps everywhere. It allocates larger video-denoising budgets to contact, alignment, and insertion states while stopping earlier when the action intent is already clear.

Budget trajectory traces showing state-dependent denoising depth over closed-loop rollouts.
Closed-loop trajectories show varying terminal denoising depth across states.

Paper materials

Citation

The public preprint is available on arXiv as arXiv:2605.27947.

@misc{sun2026santsstateadaptiveschedulerworld,
  title={SANTS: A State-Adaptive Scheduler for World Action Models},
  author={Yirui Sun and Guangyu Zhuge and Keliang Liu and Jie Gu and Xinyu Bing and Zhongxue Gan and Chunxu Tian},
  year={2026},
  eprint={2605.27947},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2605.27947},
}