SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation

Kavli Affiliate: Zheng Zhu

| First 5 Authors: Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Yongming Rao

| Summary:

Depth estimation from images serves as the fundamental step of 3D perception
for autonomous driving and is an economical alternative to expensive depth
sensors like LiDAR. The temporal photometric consistency enables
self-supervised depth estimation without labels, further facilitating its
application. However, most existing methods predict the depth solely based on
each monocular image and ignore the correlations among multiple surrounding
cameras, which are typically available for modern self-driving vehicles. In
this paper, we propose a SurroundDepth method to incorporate the information
from multiple surrounding views to predict depth maps across cameras.
Specifically, we employ a joint network to process all the surrounding views
and propose a cross-view transformer to effectively fuse the information from
multiple views. We apply cross-view self-attention to efficiently enable the
global interactions between multi-camera feature maps. Different from
self-supervised monocular depth estimation, we are able to predict real-world
scales given multi-camera extrinsic matrices. To achieve this goal, we adopt
structure-from-motion to extract scale-aware pseudo depths to pretrain the
models. Further, instead of predicting the ego-motion of each individual
camera, we estimate a universal ego-motion of the vehicle and transfer it to
each view to achieve multi-view consistency. In experiments, our method
achieves the state-of-the-art performance on the challenging multi-camera depth
estimation datasets DDAD and nuScenes.

| Search Query: ArXiv Query: search_query=au:”Zheng Zhu”&id_list=&start=0&max_results=10