Document Type

Conference Paper

Publication Date

2025

DOI

10.18653/v1/2025.emnlp-main.1235

Publication Title

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Pages

24278-24306

Conference Name

2025 Conference on Empirical Methods in Natural Language Processing, November 4-9, 2025, Suzhou, China

Abstract

Large Language Models (LLMs) have revolutionized natural language processing, yet remain vulnerable to jailbreak attacks—particularly multi-turn jailbreaks that distribute malicious intent across benign exchanges, thereby bypassing alignment mechanisms. Existing approaches often suffer from limited exploration of the adversarial space, rely on hand-crafted heuristics, or lack systematic query refinement. We propose NEXUS (Network Exploration for eXploiting Unsafe Sequences), a modular framework for constructing, refining, and executing optimized multi-turn attacks. NEXUS comprises: (1) ThoughtNet, which hierarchically expands a harmful intent into a structured semantic network of topics, entities, and query chains; (2) a feedback-driven Simulator that iteratively refines and prunes these chains through attacker–victim–judge LLM collaboration using harmfulness and semantic-similarity benchmarks; and (3) a Network Traverser that adaptively navigates the refined query space for real-time attacks. This pipeline systematically uncovers stealthy, high-success adversarial paths across LLMs. Our experimental results on several closed-source and open-source LLMs show that NEXUS can achieve a higher attack success rate, between 2.1% and 19.4%, compared to state-of-the-art approaches. Our source code is available at https://github.com/inspire-lab/NEXUS.

Rights

© 2025 Association for Computational Linguistics.

Licensed on a Creative Commons Attribution 4.0 International (CC BY 4.0) License.

Original Publication Citation

Asl, J. R., Narula, S., Ghasemigol, M., Blanco, E., & Takabi, D. (2025). NEXUS: Network Exploration for eXploiting Unsafe Sequences in multi-turn LLM jailbreaks. In C. Christodoulopoulos, T. Chakraborty, C. Rose, & V. Peng (Eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (pp. 24278-24306). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.emnlp-main.1235

ORCID

0000-0002-6154-1068 (Asl), 0009-0000-7998-7183 (Narula), 0000-0001-6661-0942 (GhasemiGol), 0000-0003-0447-3641 (Takabi)

Share

COinS