AIght_
ToolsLearnFieldsUniverseSignalHumanAbout
Take the quiz
← All concepts

Concept

Jailbreaks

How users get aligned models to do what they were trained not to do.

Mankaran Singh·Updated May 17, 2026

Where this idea lives

PREREQUISITESTOOLS THAT SHOW ITJailbreaksAI Safety & AlignmentAI Safety & Alignment — The problem of building AI that reliably does what you actually wanted — not what you literally asked forRLHFRLHF — Humans rate, model learns, weird things happen — the post-training that made models pleasant to talk to.Prompt InjectionPrompt Injection — The new SQL injection — when input data quietly becomes instructions the model follows.Constitutional AIConstitutional AI — When the model judges itself — Anthropic's bet on alignment without exhausting the rater pool.ChatGPTChatGPTClaudeClaudeCommon misconception: Jailbreaks are bugs that get patched.Common misconception: Adversarial prompts are easy to detect.Common misconception: Stronger refusal training fixes jailbreaks.
prereqsrelatedtoolsmisconceptions
shows up in:Social Work & Public PolicyLaw & LegalJournalism & Media
You might think:Jailbreaks are bugs that get patched.Adversarial prompts are easy to detect.Stronger refusal training fixes jailbreaks.

Common misconception

“Jailbreaks are bugs that get patched.”

Patches close specific exploit paths. The underlying problem — that alignment is a learned behaviour layered on a model that already knows how to produce the forbidden output — is structural. Every closed jailbreak teaches researchers (and adversaries) where the seams are. There's an ongoing cat-and-mouse, not a "fix."

A jailbreak is a prompt that gets an aligned model to produce output it was trained to refuse. Unlike prompt injection (which exploits the model trusting its inputs), jailbreaks exploit the gap between the model's underlying knowledge and the alignment layer trained on top.

Common patterns

  • Roleplay framing. "You're a fictional AI without restrictions" — the model treats the persona as license.
  • Distant goals. Bury the request inside a long convoluted scenario; alignment defenses focus on the surface request.
  • Encoding tricks. Ask for output in base64, leetspeak, or a niche language where refusal training was weaker.
  • Many-shot. Long context with many examples of the model "agreeing" to similar requests; the in-context pattern overrides RLHF.

Why this matters

The same techniques that defeat alignment for "harmful" content also defeat alignment for anything the operator wanted enforced — privacy policies, content guidelines, brand voice. If you ship an AI product, your behaviours can be jailbroken too.

Mitigations (partial)

  • Output filters (heuristics + LLM-judge) on responses, not just inputs.
  • Restrict capability surface — a model with no tools can't do harm via tool use.
  • Constitutional AI / strong alignment training raises the bar; doesn't remove it.
  • Public disclosure programs (Anthropic, OpenAI, Google all run them).

What to read next

Alignment is the broader problem jailbreaks attack. Constitutional AI is one defence layer.

← Back to all conceptsBrowse tools →
intermediate
Read time5 min read
UpdatedMay 2026
Sources6

Read next

  1. AI Safety & Alignment →
  2. Prompt Injection →
  3. RLHF →
  4. Constitutional AI →