DPO — AIght

In 2023 a Stanford paper (Rafailov et al.) showed something surprising: you don't need the full RLHF machinery — separate reward model, PPO, the works — to align a language model on preference data. You can derive a loss function that optimizes the language model directly on (preferred, rejected) pairs.

They called it Direct Preference Optimization. The math is short. The training code is short. The results, for typical alignment work, are competitive with or better than RLHF — and ~10× cheaper to run.

This is why basically every lab post-2023 either uses DPO or one of its many descendants (IPO, KTO, ORPO, SimPO, etc.).

SHARED STARTING POINT

Human preference data

RLHF — 4 steps

Collect preferences

(preferred, rejected) pairs

Train reward model

separate classifier on pairs

PPO loop

RL fine-tuning against reward model

Final aligned policy

the language model you deploy

SEPARATE REWARD MODELPPO REQUIRED

DPO — 3 steps

Collect preferences

(preferred, rejected) pairs

Direct optimization

derived loss on preference pairs

Final aligned policy

the language model you deploy

NO REWARD MODEL~10× CHEAPER

SHARED OUTCOME

Final aligned model

Same outcome, half the moving parts. DPO is what you'd build if you started from scratch.

What you need

DPO trains on preference pairs. Each pair is:

A prompt
A "chosen" answer (the better one)
A "rejected" answer (the worse one)

That's it. No separate reward model. No RL loop. No reward hacking (or at least, much less of it). You feed pairs through a loss that nudges the model toward the chosen answer and away from the rejected.

A reference model — usually the SFT-trained model from RLHF stage 1 — is kept frozen as an anchor, so the trained model doesn't drift too far from sensible language.

Why labs switched

Cost. PPO requires keeping multiple copies of large models in memory and running RL rollouts that involve the language model generating samples mid-training. DPO is a standard supervised loss on pre-collected pairs. Memory and compute drop dramatically.

Stability. PPO training is notoriously finicky — KL coefficient, clip range, value loss balance. DPO has fewer knobs and they're more forgiving.

Quality. For most alignment tasks (instruction-following, safety behaviors, style), DPO matches or exceeds RLHF. The original paper showed this on Anthropic's HH-RLHF dataset; subsequent reproductions confirmed it.

Where DPO doesn't help

For reasoning — the kind of multi-step problem solving that needs the model to learn from many trial-and-error attempts — RL-style training is making a comeback (RLAIF, RLVR, "thinking" models). DPO optimizes against fixed preferences; some problems need exploration the data didn't include.

Why this matters for your work

You probably won't run DPO yourself unless you're fine-tuning models. Where this surfaces practically: if you're picking between fine-tuning services in 2026, DPO-based ones are cheaper and easier to iterate on. Specifying preference pairs is also conceptually clearer than scoring single answers — easier for your team to produce good training data.

For evaluation: when a model card says "trained with DPO" or "trained with RLHF," the practical differences are small enough that you should evaluate on your own task, not on the training acronym.

What to read next

RLHF is the older, fuller method. Constitutional AI is a related approach where the model judges itself instead of needing human preference data. Fine-tuning is the broader topic DPO/RLHF/etc. are specific techniques within.

What you need

DPO trains on preference pairs. Each pair is:

A prompt

A "chosen" answer (the better one)

A "rejected" answer (the worse one)

A reference model — usually the SFT-trained model from RLHF stage 1 — is kept frozen as an anchor, so the trained model doesn't drift too far from sensible language.

Why labs switched

Stability. PPO training is notoriously finicky — KL coefficient, clip range, value loss balance. DPO has fewer knobs and they're more forgiving.

Why this matters for your work

For evaluation: when a model card says "trained with DPO" or "trained with RLHF," the practical differences are small enough that you should evaluate on your own task, not on the training acronym.