Learning Robust 3D Representation from CLIP via Dual Denoising

Kavli Affiliate: Wei Gao

| First 5 Authors: Shuqing Luo, Bowen Qu, Wei Gao, ,

| Summary:

In this paper, we explore a critical yet under-investigated issue: how to
learn robust and well-generalized 3D representation from pre-trained vision
language models such as CLIP. Previous works have demonstrated that cross-modal
distillation can provide rich and useful knowledge for 3D data. However, like
most deep learning models, the resultant 3D learning network is still
vulnerable to adversarial attacks especially the iterative attack. In this
work, we propose Dual Denoising, a novel framework for learning robust and
well-generalized 3D representations from CLIP. It combines a denoising-based
proxy task with a novel feature denoising network for 3D pre-training.
Additionally, we propose utilizing parallel noise inference to enhance the
generalization of point cloud features under cross domain settings. Experiments
show that our model can effectively improve the representation learning
performance and adversarial robustness of the 3D learning network under
zero-shot settings without adversarial training. Our code is available at
https://github.com/luoshuqing2001/Dual_Denoising.

| Search Query: ArXiv Query: search_query=au:”Wei Gao”&id_list=&start=0&max_results=3