Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives

Kavli Affiliate: Yi Zhou

| First 5 Authors: Sheng Luo, Wei Chen, Wanxin Tian, Rui Liu, Luanxuan Hou

| Summary:

Foundation models have indeed made a profound impact on various fields,
emerging as pivotal components that significantly shape the capabilities of
intelligent systems. In the context of intelligent vehicles, leveraging the
power of foundation models has proven to be transformative, offering notable
advancements in visual understanding. Equipped with multi-modal and multi-task
learning capabilities, multi-modal multi-task visual understanding foundation
models (MM-VUFMs) effectively process and fuse data from diverse modalities and
simultaneously handle various driving-related tasks with powerful adaptability,
contributing to a more holistic understanding of the surrounding scene. In this
survey, we present a systematic analysis of MM-VUFMs specifically designed for
road scenes. Our objective is not only to provide a comprehensive overview of
common practices, referring to task-specific models, unified multi-modal
models, unified multi-task models, and foundation model prompting techniques,
but also to highlight their advanced capabilities in diverse learning
paradigms. These paradigms include open-world understanding, efficient transfer
for road scenes, continual learning, interactive and generative capability.
Moreover, we provide insights into key challenges and future trends, such as
closed-loop driving systems, interpretability, embodied driving agents, and
world models. To facilitate researchers in staying abreast of the latest
developments in MM-VUFMs for road scenes, we have established a continuously
updated repository at https://github.com/rolsheng/MM-VUFM4DS

| Search Query: ArXiv Query: search_query=au:”Yi Zhou”&id_list=&start=0&max_results=3

Read More