FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning

Kavli Affiliate: Zhuo Li

| First 5 Authors: Jiajun Cao, Jiajun Cao, , ,

| Summary:

Vision-Language-Action (VLA) models have demonstrated significant potential
in complex scene understanding and action reasoning, leading to their
increasing adoption in end-to-end autonomous driving systems. However, the long
visual tokens of VLA models greatly increase computational costs. Current
visual token pruning methods in Vision-Language Models (VLM) rely on either
visual token similarity or visual-text attention, but both have shown poor
performance in autonomous driving scenarios. Given that human drivers
concentrate on relevant foreground areas while driving, we assert that
retaining visual tokens containing this foreground information is essential for
effective decision-making. Inspired by this, we propose FastDriveVLA, a novel
reconstruction-based vision token pruning framework designed specifically for
autonomous driving. FastDriveVLA includes a plug-and-play visual token pruner
called ReconPruner, which prioritizes foreground information through MAE-style
pixel reconstruction. A novel adversarial foreground-background reconstruction
strategy is designed to train ReconPruner for the visual encoder of VLA models.
Once trained, ReconPruner can be seamlessly applied to different VLA models
with the same visual encoder without retraining. To train ReconPruner, we also
introduce a large-scale dataset called nuScenes-FG, consisting of 241K
image-mask pairs with annotated foreground regions. Our approach achieves
state-of-the-art results on the nuScenes closed-loop planning benchmark across
different pruning ratios.

| Search Query: ArXiv Query: search_query=au:”Zhuo Li”&id_list=&start=0&max_results=3