Kavli Affiliate: Wei Gao| Summary:Decades of cognitive science establish that humans navigate environments by forming cognitive maps, defined as allocentric and topology-preserving representations of 3D space. While modern Vision-Language Models (VLMs) demonstrate emergent spatial reasoning from 2D egocentric inputs, it remains unclear whether they construct an analogous 3D internal representation. In this paper, we demonstrate […]
Continue.. Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models