Model X-ray:Detecting Backdoored Models via Decision Boundary

Kavli Affiliate: Ting Xu

| First 5 Authors: Yanghao Su, Jie Zhang, Ting Xu, Tianwei Zhang, Weiming Zhang

| Summary:

Backdoor attacks pose a significant security vulnerability for deep neural
networks (DNNs), enabling them to operate normally on clean inputs but
manipulate predictions when specific trigger patterns occur. Currently,
post-training backdoor detection approaches often operate under the assumption
that the defender has knowledge of the attack information, logit output from
the model, and knowledge of the model parameters. In contrast, our approach
functions as a lightweight diagnostic scanning tool offering interpretability
and visualization. By accessing the model to obtain hard labels, we construct
decision boundaries within the convex combination of three samples. We present
an intriguing observation of two phenomena in backdoored models: a noticeable
shrinking of areas dominated by clean samples and a significant increase in the
surrounding areas dominated by target labels. Leveraging this observation, we
propose Model X-ray, a novel backdoor detection approach based on the analysis
of illustrated two-dimensional (2D) decision boundaries. Our approach includes
two strategies focused on the decision areas dominated by clean samples and the
concentration of label distribution, and it can not only identify whether the
target model is infected but also determine the target attacked label under the
all-to-one attack strategy. Importantly, it accomplishes this solely by the
predicted hard labels of clean inputs, regardless of any assumptions about
attacks and prior knowledge of the training details of the model. Extensive
experiments demonstrated that Model X-ray has outstanding effectiveness and
efficiency across diverse backdoor attacks, datasets, and architectures.
Besides, ablation studies on hyperparameters and more attack strategies and
discussions are also provided.

| Search Query: ArXiv Query: search_query=au:”Ting Xu”&id_list=&start=0&max_results=3

Read More