Kavli Affiliate: Jing Wang
| First 5 Authors: Wentao Qu, Jing Wang, YongShun Gong, Xiaoshui Huang, Liang Xiao
| Summary:
Existing conditional Denoising Diffusion Probabilistic Models (DDPMs) with a
Noise-Conditional Framework (NCF) remain challenging for 3D scene understanding
tasks, as the complex geometric details in scenes increase the difficulty of
fitting the gradients of the data distribution (the scores) from semantic
labels. This also results in longer training and inference time for DDPMs
compared to non-DDPMs. From a different perspective, we delve deeply into the
model paradigm dominated by the Conditional Network. In this paper, we propose
an end-to-end robust semantic textbf{Seg}mentation textbf{Net}work based on a
textbf{C}onditional-Noise Framework (CNF) of Dtextbf{D}PMs, named
textbf{CDSegNet}. Specifically, CDSegNet models the Noise Network (NN) as a
learnable noise-feature generator. This enables the Conditional Network (CN) to
understand 3D scene semantics under multi-level feature perturbations,
enhancing the generalization in unseen scenes. Meanwhile, benefiting from the
noise system of DDPMs, CDSegNet exhibits strong noise and sparsity robustness
in experiments. Moreover, thanks to CNF, CDSegNet can generate the semantic
labels in a single-step inference like non-DDPMs, due to avoiding directly
fitting the scores from semantic labels in the dominant network of CDSegNet. On
public indoor and outdoor benchmarks, CDSegNet significantly outperforms
existing methods, achieving state-of-the-art performance.
| Search Query: ArXiv Query: search_query=au:”Jing Wang”&id_list=&start=0&max_results=3