Kavli Affiliate: Hsiaowen Chen| First 5 Authors: [#item_custom_name[1, [#item_custom_name[2, [#item_custom_name[3, [#item_custom_name[4, [#item_custom_name[5| Summary:Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities. However, they lack a grounded understanding of physical dynamics. This limitation arises from training VLMs on static internet-scale visual-language data that contain no causal interactions or action-conditioned changes. Consequently, it remains challenging to […]
Continue.. SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models