The Resurgence of GCG Adversarial Attacks on Large Language Models

Kavli Affiliate: Zhuo Li

| First 5 Authors: Yuting Tan, Yuting Tan, , ,

| Summary:

Gradient-based adversarial prompting, such as the Greedy Coordinate Gradient
(GCG) algorithm, has emerged as a powerful method for jailbreaking large
language models (LLMs). In this paper, we present a systematic appraisal of GCG
and its annealing-augmented variant, T-GCG, across open-source LLMs of varying
scales. Using Qwen2.5-0.5B, LLaMA-3.2-1B, and GPT-OSS-20B, we evaluate attack
effectiveness on both safety-oriented prompts (AdvBench) and
reasoning-intensive coding prompts. Our study reveals three key findings: (1)
attack success rates (ASR) decrease with model size, reflecting the increasing
complexity and non-convexity of larger models’ loss landscapes; (2)
prefix-based heuristics substantially overestimate attack effectiveness
compared to GPT-4o semantic judgments, which provide a stricter and more
realistic evaluation; and (3) coding-related prompts are significantly more
vulnerable than adversarial safety prompts, suggesting that reasoning itself
can be exploited as an attack vector. In addition, preliminary results with
T-GCG show that simulated annealing can diversify adversarial search and
achieve competitive ASR under prefix evaluation, though its benefits under
semantic judgment remain limited. Together, these findings highlight the
scalability limits of GCG, expose overlooked vulnerabilities in reasoning
tasks, and motivate further development of annealing-inspired strategies for
more robust adversarial evaluation.

| Search Query: ArXiv Query: search_query=au:”Zhuo Li”&id_list=&start=0&max_results=3