Act to See, See to Act: Diffusion-Driven Perception-Action Interplay for Adaptive Policies

Kavli Affiliate: Jing Wang

| First 5 Authors: Jing Wang, Jing Wang, , ,

| Summary:

Existing imitation learning methods decouple perception and action, which
overlooks the causal reciprocity between sensory representations and action
execution that humans naturally leverage for adaptive behaviors. To bridge this
gap, we introduce Action-Guided Diffusion Policy (DP-AG), a unified
representation learning that explicitly models a dynamic interplay between
perception and action through probabilistic latent dynamics. DP-AG encodes
latent observations into a Gaussian posterior via variational inference and
evolves them using an action-guided SDE, where the Vector-Jacobian Product
(VJP) of the diffusion policy’s noise predictions serves as a structured
stochastic force driving latent updates. To promote bidirectional learning
between perception and action, we introduce a cycle-consistent contrastive loss
that organizes the gradient flow of the noise predictor into a coherent
perception-action loop, enforcing mutually consistent transitions in both
latent updates and action refinements. Theoretically, we derive a variational
lower bound for the action-guided SDE, and prove that the contrastive objective
enhances continuity in both latent and action trajectories. Empirically, DP-AG
significantly outperforms state-of-the-art methods across simulation benchmarks
and real-world UR5 manipulation tasks. As a result, our DP-AG offers a
promising step toward bridging biological adaptability and artificial policy
learning.

| Search Query: ArXiv Query: search_query=au:”Jing Wang”&id_list=&start=0&max_results=3