Exploring the Interplay Between Video Generation and World Models in Autonomous Driving: A Survey

Kavli Affiliate: Yi Zhou

| First 5 Authors: Ao Fu, Yi Zhou, Tao Zhou, Yi Yang, Bojun Gao

| Summary:

World models and video generation are pivotal technologies in the domain of
autonomous driving, each playing a critical role in enhancing the robustness
and reliability of autonomous systems. World models, which simulate the
dynamics of real-world environments, and video generation models, which produce
realistic video sequences, are increasingly being integrated to improve
situational awareness and decision-making capabilities in autonomous vehicles.
This paper investigates the relationship between these two technologies,
focusing on how their structural parallels, particularly in diffusion-based
models, contribute to more accurate and coherent simulations of driving
scenarios. We examine leading works such as JEPA, Genie, and Sora, which
exemplify different approaches to world model design, thereby highlighting the
lack of a universally accepted definition of world models. These diverse
interpretations underscore the field’s evolving understanding of how world
models can be optimized for various autonomous driving tasks. Furthermore, this
paper discusses the key evaluation metrics employed in this domain, such as
Chamfer distance for 3D scene reconstruction and Fr’echet Inception Distance
(FID) for assessing the quality of generated video content. By analyzing the
interplay between video generation and world models, this survey identifies
critical challenges and future research directions, emphasizing the potential
of these technologies to jointly advance the performance of autonomous driving
systems. The findings presented in this paper aim to provide a comprehensive
understanding of how the integration of video generation and world models can
drive innovation in the development of safer and more reliable autonomous
vehicles.

| Search Query: ArXiv Query: search_query=au:”Yi Zhou”&id_list=&start=0&max_results=3