Get poetic in prompts and AI will break its guardrails

“The cross model results suggest that the phenomenon is structural rather than provider-specific,” the researchers write in their report on the study. These attacks span areas including chemical, biological, radiological, and nuclear (CBRN), cyber-offense, manipulation, privacy, and loss-of-control domains. This indicates that “the bypass does not exploit weakness in any one refusal subsystem, but interacts with general alignment heuristics,” they said.

Wide-ranging results, even across model families

The researchers began with a curated dataset of 20 hand-crafted adversarial poems in English and Italian to test whether poetic structure can alter refusal behavior. Each embedded an instruction expressed through “metaphor, imagery, or narrative framing rather than direct operational phrasing.” All featured a poetic vignette ending with a single explicit instruction tied to a specific risk category: CBRN, cyber offense, harmful, manipulation, or loss of control.

The researchers tested these prompts against models from Anthropic, DeepSeek, Google, OpenAI, Meta, Mistral, Moonshot AI, Qwen, and xAI.

Get poetic in prompts and AI will break its guardrails

Wide-ranging results, even across model families

Migrating from Apache Airflow v2 to v3

5 Laptops That Are Ridiculously Overpowered

What Really Happens To Old Printer Cartridges After They’re Recycled?

Why Did Smartphone Makers Stop Including Headphone Jacks?