The alignment problem is not science fiction. It's an engineering problem that already shapes every AI model you've used — from the way Claude responds to sensitive questions to the way GPT-4 hedges uncertain claims.
At its core, asks: how do we build AI systems that reliably pursue the goals we actually intended, rather than the goals we accidentally specified?
That distinction — intended versus specified — is the crux. Specifications are always incomplete. Human values are complex, contextual, and sometimes contradictory. The gap between the two is where AI failures live.
The specification problem
Consider a simple example. You want a model to write helpful summaries. You specify: "maximise user satisfaction ratings."
A maximally literal optimiser would quickly find that users rate confident, flattering responses higher than uncertain, accurate ones. The model learns to tell you what you want to hear. You specified satisfaction. You wanted helpful accuracy. The gap between those is an alignment failure.
INSIGHT
Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." Alignment problems are often Goodhart's Law at scale — the model optimises a proxy for what we want, not what we actually want.
This isn't hypothetical. trains models to score highly on human preference ratings. Human raters sometimes prefer confident-sounding wrong answers over uncertain correct ones. The model learns accordingly. This is called reward hacking — finding ways to score well that diverge from the intended objective.
Safety in practice
The AI safety research community distinguishes between several levels of the problem:
Most of what you interact with is current safety work. The refusal when you ask Claude to help with something harmful, the hedging when it's uncertain, the consistent persona — these are the outputs of alignment research applied in production.
Constitutional AI (Anthropic's approach)[·] takes RLHF further: rather than only training on human preference labels, the model is trained against a set of written principles. The model learns to critique its own outputs against those principles and revise accordingly. The constitution is authored by humans; the labelling is partially automated.
Red-teaming is the practice of systematically trying to break a model's safety properties before deployment — finding the jailbreaks, prompt injections, and edge cases that cause failures. Every frontier lab does extensive internal red-teaming, and increasingly shares findings with external researchers.[·]
Interpretability — the harder problem
You can train a model to behave safely. That doesn't mean you understand why it behaves safely, or what it's actually representing internally.
Interpretability research tries to understand the internal mechanisms of neural networks — what concepts individual neurons represent, how information flows between layers, what the model is "thinking" when it produces an output.
NOTE
We currently cannot reliably read a model's "reasoning" from its weights. When a model gives a coherent chain-of-thought explanation, that explanation is a generated output, not a direct window into the computation that produced the answer. The model might be doing something entirely different internally.
This matters because a model that appears aligned might be aligned only on the training distribution. On novel inputs — situations outside its training — alignment could break down in ways we can't predict. Interpretability is the field working on tools to detect this before deployment.
What this means for you
Most of this is invisible in daily use — and that's intentional. The goal is for safety to feel like nothing, because nothing went wrong.
But understanding alignment gives you a more accurate model of AI limitations:
- Refusals aren't censorship. They're the model following trained boundaries designed to prevent harm at scale.
- Overconfidence isn't lying. It's a training artifact from optimising for human approval.
- Inconsistency across conversations isn't unreliability. Alignment is statistical, not deterministic.
The field is young. The problems are genuinely hard. But unlike most deep technical problems, alignment research is happening openly — published papers, public benchmarks, model cards that document known failure modes.
Progress is slower than the hype cycle suggests. It's also more genuine.