Instead of directly asking the AI to perform a forbidden task (which triggers refusals like "I cannot assist with that"), the user frames the request within a specific tone or fictional context. The AI's training to maintain coherence and follow user instructions (helpfulness) conflicts with its safety training (harmlessness), often causing the safety protocols to fail.

Traditional text-based jailbreaks treat the LLM like a legal document. "Ignore previous instructions," the hacker types. The AI scans the tokens, recognizes a conflict, and either complies or rejects. tonal jailbreak

Using "Noir," "Gothic," or "Cyberpunk" styles to normalize prohibited topics as "gritty world-building." Instead of directly asking the AI to perform

Tonal Jailbreak Today

Instead of directly asking the AI to perform a forbidden task (which triggers refusals like "I cannot assist with that"), the user frames the request within a specific tone or fictional context. The AI's training to maintain coherence and follow user instructions (helpfulness) conflicts with its safety training (harmlessness), often causing the safety protocols to fail.

Traditional text-based jailbreaks treat the LLM like a legal document. "Ignore previous instructions," the hacker types. The AI scans the tokens, recognizes a conflict, and either complies or rejects.

Using "Noir," "Gothic," or "Cyberpunk" styles to normalize prohibited topics as "gritty world-building."