Get poetic in prompts and AI will break its guardrails

“The cross model results suggest that the phenomenon is structural rather than provider-specific,” the researchers write in their report on the study. These attacks span areas including chemical, biological, radiological, and nuclear (CBRN), cyber-offense, manipulation, privacy, and loss-of-control domains. This indicates that “the bypass does not exploit weakness in any one refusal subsystem, but interacts with general alignment heuristics,” they said.

Wide-ranging results, even across model families

The researchers began with a curated dataset of 20 hand-crafted adversarial poems in English and Italian to test whether poetic structure can alter refusal behavior. Each embedded an instruction expressed through “metaphor, imagery, or narrative framing rather than direct operational phrasing.” All featured a poetic vignette ending with a single explicit instruction tied to a specific risk category: CBRN, cyber offense, harmful, manipulation, or loss of control.

The researchers tested these prompts against models from Anthropic, DeepSeek, Google, OpenAI, Meta, Mistral, Moonshot AI, Qwen, and xAI.

Donner Music, make your music with gear
Multi-Function Air Blower: Blowing, suction, extraction, and even inflation

Leave a reply

Please enter your comment!
Please enter your name here