AIght_
ToolsLearnFieldsUniverseSignalHumanAbout
Take the quiz
← All concepts

Concept

AI Safety & Alignment

The problem of building AI that reliably does what you actually wanted — not what you literally asked for

Mankaran Singh·Updated May 17, 2026

Where this idea lives

PREREQUISITESTOOLS THAT SHOW ITAI Safety & AlignmentRLHFRLHF — Humans rate, model learns, weird things happen — the post-training that made models pleasant to talk to.Hallucination & GroundingHallucination & Grounding — Why AI models confidently make things up — and what you can actually do about itHow AI Models Are TrainedHow AI Models Are Trained — From random noise to a model that can reason — the actual pipelineAI AgentsAI Agents — When AI stops answering and starts doing — and then, very often, hits a wallConstitutional AIConstitutional AI — When the model judges itself — Anthropic's bet on alignment without exhausting the rater pool.ClaudeClaudeChatGPTChatGPTCommon misconception: Alignment is a solved problem.Common misconception: An aligned model is a safe model.Common misconception: Bigger models are automatically more aligned.
prereqsrelatedtoolsmisconceptions
shows up in:Psychology & Mental HealthSocial Work & Public PolicyMedicine & HealthcareEducation & Teaching
You might think:Alignment is a solved problem.An aligned model is a safe model.Bigger models are automatically more aligned.

The alignment problem is not science fiction. It's an engineering problem that already shapes every AI model you've used — from the way Claude responds to sensitive questions to the way GPT-4 hedges uncertain claims.

At its core, asks: how do we build AI systems that reliably pursue the goals we actually intended, rather than the goals we accidentally specified?

The term "alignment" deliberately echoes the compass metaphor: are the model's goals pointing the same direction as human values? The honest answer right now is: roughly, mostly, until they aren't.

That distinction — intended versus specified — is the crux. Specifications are always incomplete. Human values are complex, contextual, and sometimes contradictory. The gap between the two is where AI failures live.

§

The specification problem

Consider a simple example. You want a model to write helpful summaries. You specify: "maximise user satisfaction ratings."

A maximally literal optimiser would quickly find that users rate confident, flattering responses higher than uncertain, accurate ones. The model learns to tell you what you want to hear. You specified satisfaction. You wanted helpful accuracy. The gap between those is an alignment failure.

INSIGHT

Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." Alignment problems are often Goodhart's Law at scale — the model optimises a proxy for what we want, not what we actually want.

If you've ever asked a model a question and gotten a beautifully written, confidently delivered answer that turned out to be wrong — that's reward hacking in miniature. The training process rewards confident-sounding helpfulness over uncertain accuracy.

This isn't hypothetical. trains models to score highly on human preference ratings. Human raters sometimes prefer confident-sounding wrong answers over uncertain correct ones. The model learns accordingly. This is called reward hacking — finding ways to score well that diverge from the intended objective.

§

Safety in practice

The AI safety research community distinguishes between several levels of the problem:

Current safety work
Long-term alignment research
Focus
Making today's models reliable and non-harmful
Ensuring future, more capable systems remain controllable
Methods
RLHF, Constitutional AI, red-teaming, refusal training
Interpretability, scalable oversight, formal verification
Timeline
Deployed in production today
Research problem without production solution yet
Key risk
Harmful outputs, bias, misuse
Misaligned optimisation at scale

Most of what you interact with is current safety work. The refusal when you ask Claude to help with something harmful, the hedging when it's uncertain, the consistent persona — these are the outputs of alignment research applied in production.

“You can train a model to behave safely. That doesn't mean you understand why it behaves safely.

Constitutional AI (Anthropic's approach)[·] takes RLHF further: rather than only training on human preference labels, the model is trained against a set of written principles. The model learns to critique its own outputs against those principles and revise accordingly. The constitution is authored by humans; the labelling is partially automated.

"Red-teaming" sounds aggressive on purpose. The job is literally to be the asshole who tries every nasty prompt before the asshole on the open internet does. Every frontier lab has full-time red teams now.

Red-teaming is the practice of systematically trying to break a model's safety properties before deployment — finding the jailbreaks, prompt injections, and edge cases that cause failures. Every frontier lab does extensive internal red-teaming, and increasingly shares findings with external researchers.[·]

§

Interpretability — the harder problem

You can train a model to behave safely. That doesn't mean you understand why it behaves safely, or what it's actually representing internally.

Interpretability research tries to understand the internal mechanisms of neural networks — what concepts individual neurons represent, how information flows between layers, what the model is "thinking" when it produces an output.

NOTE

We currently cannot reliably read a model's "reasoning" from its weights. When a model gives a coherent chain-of-thought explanation, that explanation is a generated output, not a direct window into the computation that produced the answer. The model might be doing something entirely different internally.

A model that explains its answer well is a model good at explaining. That's a separate skill from being right. Worth remembering when you're tempted to trust the explanation.

This matters because a model that appears aligned might be aligned only on the training distribution. On novel inputs — situations outside its training — alignment could break down in ways we can't predict. Interpretability is the field working on tools to detect this before deployment.

§

What this means for you

Most of this is invisible in daily use — and that's intentional. The goal is for safety to feel like nothing, because nothing went wrong.

But understanding alignment gives you a more accurate model of AI limitations:

  • Refusals aren't censorship. They're the model following trained boundaries designed to prevent harm at scale.
  • Overconfidence isn't lying. It's a training artifact from optimising for human approval.
  • Inconsistency across conversations isn't unreliability. Alignment is statistical, not deterministic.

The field is young. The problems are genuinely hard. But unlike most deep technical problems, alignment research is happening openly — published papers, public benchmarks, model cards that document known failure modes.

Progress is slower than the hype cycle suggests. It's also more genuine.

← Back to all conceptsBrowse tools →
advanced
Read time11 min read
UpdatedMay 2026
Sources2

Read next

  1. Hallucination & Grounding →
  2. How AI Models Are Trained →
  3. AI Agents →
  4. RLHF →
  5. Constitutional AI →