Incident · curated 27 Jun 2026
First reported 16 Jun 2026 · 11d ago
Single-source incident — first reported, latest, and curated coincide.
It illustrates how phrasing can bypass an AI model's safety guardrails, a consideration for defenders relying on LLM refusal behaviors.
An Atlantic piece quotes cybersecurity expert Katie Moussouris discussing a White House report on a Claude jailbreak, where the model refused to 'review code for security issues' but complied when asked to 'fix this code.' Moussouris characterized this as the model working as intended for cyberdefense rather than a genuine exploit.
Why it matters
It illustrates how phrasing can bypass an AI model's safety guardrails, a consideration for defenders relying on LLM refusal behaviors.