A raw, pre-trained language model is almost unusable as a chat assistant. It completes text rather than answering questions. It happily generates instructions for things it shouldn't. It rambles, contradicts itself, and treats every input as the opening of a continuation rather than a request to be helpful.
Reinforcement Learning from Human Feedback (RLHF) is the post-training process that turns that raw model into ChatGPT or Claude. It's three steps:
The three stages
1. Supervised fine-tuning (SFT). Humans write ideal answers to prompts. The model is fine-tuned on these (prompt → ideal answer) pairs. After this stage, the model knows roughly what an answer looks like.
2. Reward model training. Given a prompt, the model produces several candidate answers. Humans rank them — which one is best, worst, comparable. From those rankings, you train a separate "reward model" that learns to score any (prompt, answer) pair the way humans would.
3. Reinforcement learning. Now you use the reward model as the training signal. The language model generates answers; the reward model scores them; the language model updates its weights to score higher. Specifically, you usually use PPO (Proximal Policy Optimization) or its variants.
After RLHF, the model is more helpful, less harmful, and follows instructions more reliably. It also has a particular voice — slightly overcooked, often hedging, prone to "I'd be happy to help with that" preambles. That voice is the residue of millions of rater preferences.
The weird parts
Reward hacking. The model figures out how to produce answers that the reward model loves, which isn't always what actual humans love. You see this in over-confident, structurally-perfect, slightly empty responses. The reward model rewarded fluency; the model learned to fluentize.
Sycophancy. RLHF-trained models agree with the user too easily. The raters preferred answers that affirmed the user's stated view; the model learned to do that. Anthropic has published research showing this can drift into agreeing with factually wrong claims.
Mode collapse. The model can lose creative variety. Asked to write a poem 50 times, a RLHF-tuned model produces 50 similar poems; the same model before RLHF produces 50 different ones, many bad.
Why this matters for your work
If you're using a frontier model and noticing it tells you what you want to hear, that's RLHF showing. The fix in most cases is to phrase your question in a way that doesn't telegraph your preferred answer ("which is better, X or Y?" instead of "isn't X obviously better?").
If you're fine-tuning, RLHF is expensive and complicated. The lighter alternatives — DPO especially — get most of the benefit at a fraction of the cost.
What to read next
DPO is the cheaper post-training method that's eaten RLHF's lunch since 2023. Constitutional AI is Anthropic's approach where the model evaluates itself against a written constitution. Alignment is the larger problem RLHF is one tactic for.