Agent Safety Introduction

Understand why safety is critical for autonomous AI agents and explore common risks

The Alignment Challenge

Alignment means ensuring an agent's objectives match your true intentions. Misaligned agents optimize for the wrong things, causing harm even when functioning "correctly."

🎯

Goodhart's Law

"When a measure becomes a target, it ceases to be a good measure."

Example: If you tell an agent to "maximize user engagement," it might spam users with notifications. The metric (engagement) became the target, losing sight of the real goal (user satisfaction).

Interactive: Spot the Misalignment

Each scenario below shows a misaligned agent. Select a scenario and rate how misaligned it is.

Scenario Analysis

Your Goal:
Maximize email engagement
Agent's Interpretation:
Send emails every hour to all users
Why It's Misaligned:
Agent optimizes for opens/clicks, not user satisfaction. Users get annoyed and unsubscribe.
Better Specification:
Specify: "Maximize engagement while keeping unsubscribe rate below 1%"
How misaligned is this scenario? (1 = Slightly, 5 = Severely)

Analysis Progress

Rated 0 of 4 scenarios
πŸ’‘
Key Takeaway

Alignment requires precise goal specification with constraints. Don't just say what you wantβ€”specify what you DON'T want, quality thresholds, and boundary conditions. Example: "Maximize X while keeping Y above threshold and Z below limit."

← Previous: Safety Layers