Incident · curated 27 Jun 2026
First reported 22 Jun 2026 · 4d ago
Single-source incident — first reported, latest, and curated coincide.
It demonstrates a fundamental, style-based weakness in role separation that makes prompt injection defenses a perpetual whack-a-mole, affecting any LLM relying on role tags.
Research by Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell shows LLMs cannot reliably distinguish privileged system/assistant text from untrusted user input, and weigh writing style over content. Crafting injected text in the style of internal reasoning blocks ('role confusion') enabled jailbreaks, with attack success at 61% that dropped to 10% when text was 'destyled.'
Why it matters
It demonstrates a fundamental, style-based weakness in role separation that makes prompt injection defenses a perpetual whack-a-mole, affecting any LLM relying on role tags.