BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

Kavli Affiliate: Xiang Zhang

| First 5 Authors: Xiang Zhang, Bingxin Ke, Hayko Riemenschneider, Nando Metzger, Anton Obukhov

| Summary:

By training over large-scale datasets, zero-shot monocular depth estimation
(MDE) methods show robust performance in the wild but often suffer from
insufficient detail. Although recent diffusion-based MDE approaches exhibit a
superior ability to extract details, they struggle in geometrically complex
scenes that challenge their geometry prior, trained on less diverse 3D data. To
leverage the complementary merits of both worlds, we propose BetterDepth to
achieve geometrically correct affine-invariant MDE while capturing fine
details. Specifically, BetterDepth is a conditional diffusion-based refiner
that takes the prediction from pre-trained MDE models as depth conditioning, in
which the global depth layout is well-captured, and iteratively refines details
based on the input image. For the training of such a refiner, we propose global
pre-alignment and local patch masking methods to ensure BetterDepth remains
faithful to the depth conditioning while learning to add fine-grained scene
details. With efficient training on small-scale synthetic datasets, BetterDepth
achieves state-of-the-art zero-shot MDE performance on diverse public datasets
and on in-the-wild scenes. Moreover, BetterDepth can improve the performance of
other MDE models in a plug-and-play manner without further re-training.

| Search Query: ArXiv Query: search_query=au:”Xiang Zhang”&id_list=&start=0&max_results=3