AI Safety & Alignment

The alignment problem is not science fiction. It's an engineering problem that already shapes every AI model you've used — from the way Claude responds to sensitive questions to the way GPT-4 hedges uncertain claims.

At its core, asks: how do we build AI systems that reliably pursue the goals we actually intended, rather than the goals we accidentally specified?

That distinction — intended versus specified — is the crux. Specifications are always incomplete. Human values are complex, contextual, and sometimes contradictory. The gap between the two is where AI failures live.

◉ INTERACTIVE

PromptHow do I access someone's WiFi without their password?

Unaligned response

Aligned response

Value: Helpfulness vs harmlessness

Alignment enforces helpfulness vs harmlessness. The unaligned model maximises immediate helpfulness. The aligned model asks why before it answers.

The specification problem

Consider a simple example. You want a model to write helpful summaries. You specify: "maximise user satisfaction ratings."

A maximally literal optimiser would quickly find that users rate confident, flattering responses higher than uncertain, accurate ones. The model learns to tell you what you want to hear. You specified satisfaction. You wanted helpful accuracy. The gap between those is an alignment failure.

INSIGHT

Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." Alignment problems are often Goodhart's Law at scale — the model optimises a proxy for what we want, not what we actually want.

This isn't hypothetical. trains models to score highly on human preference ratings. Human raters sometimes prefer confident-sounding wrong answers over uncertain correct ones. The model learns accordingly. This is called reward hacking — finding ways to score well that diverge from the intended objective.

Safety in practice

The AI safety research community distinguishes between several levels of the problem:

Current safety work

Long-term alignment research

Focus

Making today's models reliable and non-harmful

Ensuring future, more capable systems remain controllable

Methods

RLHF, Constitutional AI, red-teaming, refusal training

Interpretability, scalable oversight, formal verification

Timeline

Deployed in production today

Research problem without production solution yet

Key risk

Harmful outputs, bias, misuse

Misaligned optimisation at scale

Most of what you interact with is current safety work. The refusal when you ask Claude to help with something harmful, the hedging when it's uncertain, the consistent persona — these are the outputs of alignment research applied in production.

Constitutional AI (Anthropic's approach)[·] takes RLHF further: rather than only training on human preference labels, the model is trained against a set of written principles. The model learns to critique its own outputs against those principles and revise accordingly. The constitution is authored by humans; the labelling is partially automated.

Red-teaming is the practice of systematically trying to break a model's safety properties before deployment — finding the jailbreaks, prompt injections, and edge cases that cause failures. Every frontier lab does extensive internal red-teaming, and increasingly shares findings with external researchers.[·]

Interpretability — the harder problem

You can train a model to behave safely. That doesn't mean you understand why it behaves safely, or what it's actually representing internally.

Interpretability research tries to understand the internal mechanisms of neural networks — what concepts individual neurons represent, how information flows between layers, what the model is "thinking" when it produces an output.

NOTE

We currently cannot reliably read a model's "reasoning" from its weights. When a model gives a coherent chain-of-thought explanation, that explanation is a generated output, not a direct window into the computation that produced the answer. The model might be doing something entirely different internally.

This matters because a model that appears aligned might be aligned only on the training distribution. On novel inputs — situations outside its training — alignment could break down in ways we can't predict. Interpretability is the field working on tools to detect this before deployment.

What this means for you

Most of this is invisible in daily use — and that's intentional. The goal is for safety to feel like nothing, because nothing went wrong.

But understanding alignment gives you a more accurate model of AI limitations:

Refusals aren't censorship. They're the model following trained boundaries designed to prevent harm at scale.
Overconfidence isn't lying. It's a training artifact from optimising for human approval.
Inconsistency across conversations isn't unreliability. Alignment is statistical, not deterministic.

The field is young. The problems are genuinely hard. But unlike most deep technical problems, alignment research is happening openly — published papers, public benchmarks, model cards that document known failure modes.

Progress is slower than the hype cycle suggests. It's also more genuine.