AIght_
ToolsLearnFieldsUniverseSignalHumanAbout
Take the quiz
← All concepts

Concept

How AI Models Are Trained

From random noise to a model that can reason — the actual pipeline

Mankaran Singh·Updated May 17, 2026

Where this idea lives

TOOLS THAT SHOW ITHow AI Models Are TrainedFine-TuningFine-Tuning — Teaching a model new habits, not new knowledgeAI Safety & AlignmentAI Safety & Alignment — The problem of building AI that reliably does what you actually wanted — not what you literally asked forTransformersTransformers — The architecture that changed what AI could do with language — and then everything elseScaling LawsScaling Laws — Why bigger keeps working — and the question of where it stops.ChatGPTChatGPTClaudeClaudeDeepSeekDeepSeekCommon misconception: Training is just gradient descent on text.Common misconception: Training data is mostly public.Common misconception: Models 'memorize' their training data verbatim.
prereqsrelatedtoolsmisconceptions
shows up in:Physics & EngineeringBiology & Life SciencesEnvironmental Science & Climate
You might think:Training is just gradient descent on text.Training data is mostly public.Models 'memorize' their training data verbatim.

Every AI model you've ever used started as random noise.

Billions of parameters — numerical weights inside a neural network — initialised to essentially nothing. Then, through a process of repeated prediction and correction running for weeks on thousands of specialised chips, those weights gradually organised into something that can write, reason, and converse.

GPT-4 reportedly trained on roughly 25,000 A100 GPUs. At cloud pricing, the compute cost alone runs into tens of millions of dollars. And that's before the salary and electricity bills.

Understanding how this happens doesn't require a PhD. It requires one key insight: a model gets better by being wrong.

§

Phase 1 — Pretraining

The first phase is . The model sees an enormous quantity of text — web pages, books, code, academic papers — and learns a single task: predict the next token.

1

Feed text

The cat sat on the...

›
2

Predict

Model guesses next token: "mat"

›
3

Compare

Check against actual next token

›
4

Update

Adjust weights to reduce error

›
5

Repeat

Trillions of times across all training data

That's it. No labels, no human feedback, no explicit teaching of facts. Just: see text, predict next token, get corrected, update, repeat. At a massive enough scale, this simple task forces the model to learn grammar, facts, reasoning patterns, and world knowledge — because all of those are required to predict text well.

If "predict the next word" sounds too simple to produce something like GPT-4, you're in good company. Most of the field thought so too, until scale started doing things nobody expected.

A model that doesn't understand causality will predict poorly. A model that doesn't know common facts will predict poorly. Prediction accuracy is a proxy for understanding.

INSIGHT

Pretraining is unsupervised — it requires no human labels. This is why models can be trained on internet-scale data. The "label" for every piece of text is just the next word.

How much compute and how much data? [·] give surprisingly clean answers. For a given compute budget, there's an optimal ratio of model size to training tokens — too big a model with too little data wastes compute; too small a model with too much data wastes compute the other way. The Chinchilla paper recalibrated the whole industry's intuition about this.

After pretraining, the model is a powerful but raw text predictor. It will complete your sentence — not necessarily in the way you intended, and not necessarily helpfully.

§

Phase 2 — Instruction tuning

Raw pretraining produces a model that can continue text. It doesn't produce a model that follows instructions. For that, a second phase is needed.

(also called supervised fine-tuning, SFT) trains the model on curated examples of instruction-following. A dataset of prompts and ideal responses — written or reviewed by humans — teaches the model to behave as an assistant rather than a text completer.

“A small, well-curated instruction dataset usually beats a huge noisy one. Quality of examples matters more than count.
Stanford's Alpaca demonstrated this with ~52K examples — small by industry standards, but well-targeted. The resulting model punched well above its weight.

The result is a model that responds to "Summarise this article" by summarising — not by continuing the article.

§

Phase 3 — RLHF

The final phase is where modern models get their characteristic polish: .[·]

The process works in two steps:

Reward model training. Humans compare pairs of model outputs and choose which is better. These preferences train a separate "reward model" that learns to predict human preference scores for any given output.

Policy optimisation. The main model is then fine-tuned using reinforcement learning — it generates outputs, the reward model scores them, and the main model's weights are updated to produce higher-scoring outputs over time.

With RLHF
Without RLHF
Refusals
Calibrated — declines harmful requests, handles edge cases
None or naive — either refuses too much or too little
Tone
Consistently helpful, appropriately hedged
Can be terse, overconfident, or verbose randomly
Format
Structured when structure helps, conversational when not
Inconsistent
Safety
Trained on human judgments of harm
Only pretraining data distribution
RLHF is the reason Claude feels different from a raw GPT model. The underlying capability comes from pretraining. The helpfulness and tone comes from RLHF.

It's also imperfect: human raters introduce their own biases, and reward hacking (the model finding ways to score well that diverge from real helpfulness) remains a real problem.[·]

§

What training doesn't give you

Training gives the model compressed knowledge from its training data. It doesn't give the model:

  • Knowledge of events after the training cutoff
  • Access to real-time information
  • The ability to know what it doesn't know
  • Guaranteed factual accuracy on rare topics

These aren't flaws that will be engineered away. They're the nature of a statistical model trained on a fixed dataset. Knowing this shapes how you use AI well — you bring current information, verify critical facts, and treat the model's knowledge cutoff as a hard boundary.

The model is not a database. It's a compressed, reasoning-capable representation of what it was trained on. Use it accordingly.

← Back to all conceptsBrowse tools →
intermediate
Read time10 min read
UpdatedMay 2026
Sources3

Read next

  1. Fine-Tuning →
  2. AI Safety & Alignment →
  3. Transformers →
  4. Scaling Laws →
  5. RLHF →