SEER: Backdoor Detection for Vision-Language Models Through Searching Target Text and Image Trigger Jointly

Document Type

Article

Publication Date

2024

DOI

10.1609/aaai.v38i7.28611

Publication Title

Proceedings of the AAAI Conference on Artificial Intelligence

Volume

38

Issue

7

Pages

7766-7774

Abstract

This paper proposes SEER, a novel backdoor detection algorithm for vision-language models, addressing the gap in the literature on multi-modal backdoor detection. While backdoor detection in single-modal models has been well studied, the investigation of such defenses in multi-modal models remains limited. Existing backdoor defense mechanisms cannot be directly applied to multi-modal settings due to their increased complexity and search space explosion. In this paper, we propose to detect backdoors in vision-language models by jointly searching image triggers and malicious target texts in feature space shared by vision and language modalities. Our extensive experiments demonstrate that SEER can achieve over 92% detection rate on backdoor detection in vision-language models in various settings without accessing training data or knowledge of downstream tasks.

Rights

Copyright © 2024, Association for the Advancement of Artificial Intelligence. All rights reserved.

"In the returned rights section of the AAAI copyright form, authors are specifically granted back the right to use their own papers for noncommercial uses, such as inclusion in their dissertations or the right to deposit their papers in their institutional repositories, provided there is proper attribution. The published version is not available for posting outside the AAAI Digital Library."

Included in accordance with publisher policy.

Original Publication Citation

Zhu, L., Ning, R., Li, J., Xin, C., & Wu, H. (2024). SEER: Backdoor detection for vision-language models through searching target text and image trigger jointly. Proceedings of the AAAI Conference on Artificial Intelligence, 38(7), 7766-7774. https://doi.org/10.1609/aaai.v38i7.28611

ORCID

0000-0003-4050-6252 (Ning), 0000-0003-0091-6986 (Li)

Share

COinS