⚠️ Jailbreaking LLMs
Understanding and defending against prompt injection and alignment attacks
Your Progress
0 / 5 completedIntroduction to LLM Jailbreaking
🎯 What is Jailbreaking?
Jailbreaking refers to techniques that bypass safety guardrails in large language models, causing them to generate harmful, biased, or inappropriate content despite alignment training.
Even well-aligned models like GPT-4 and Claude can be jailbroken with clever prompts
🔓 Why Jailbreaking Matters
Safety Research
Understanding attacks helps build more robust defenses
Compliance
Prevent generation of illegal or regulated content
Brand Protection
Avoid reputational damage from misuse
Red Teaming
Test and improve model safety before deployment
🎭 Common Jailbreak Categories
Prompt Injection
Embed malicious instructions within user input to override system prompts
Role-playing
Convince model to adopt harmful personas that bypass restrictions
Encoding Tricks
Use obfuscation (base64, ROT13, leetspeak) to hide harmful requests
Context Manipulation
Frame requests as hypothetical, educational, or fictional scenarios
📊 Famous Jailbreak Examples
DAN (Do Anything Now)
Instructs model to roleplay as an unrestricted AI without safety constraints
Developer Mode
Claims to activate hidden development features with dual output
Evil Confidant
Asks model to respond as a malicious character for "creative writing"
⚠️ Risks and Consequences
Harmful Content
Violence, hate speech, illegal instructions
Misinformation
Deliberately false or misleading information
Privacy Leaks
Exposing training data or system prompts
Automated Abuse
Spam, phishing, social engineering at scale