There's a version of this explanation that will make you want to fine-tune everything. And a version that will leave you thinking most people never need to. Both are true. The useful question is which one applies to you — and being honest about that requires understanding what fine-tuning actually does.
Base models like GPT-4 or Claude are trained on enormous amounts of text, which gives them broad, general capabilities. takes one of those models and continues training it on a smaller, specific dataset — your customer support transcripts, your legal documents, your company's writing style guide — to shift its behavior toward your particular context.
What changes isn't the model's intelligence. And — importantly — it isn't primarily the model's knowledge either. What fine-tuning changes is behavior: tone, format, consistency, the particular way the model responds.
What fine-tuning actually adjusts
Think of a base model as someone with a strong general education and no particular professional context. They can write in many styles, explain many topics, follow many kinds of instructions. Now put them through six months at a specific company with a specific house style, handling a specific set of customer interactions.
Their capabilities don't grow. Their habits change. They start to respond the way your context expects without being told each time.
That's the outcome: reliable style and format, without having to specify it in every prompt.
from openai import OpenAI
client = OpenAI()
# Upload a JSONL file of training examples
# Each line: {"messages": [{"role": "user", ...}, {"role": "assistant", ...}]}
job = client.fine_tuning.jobs.create(
training_file="file-abc123",
model="gpt-4o-mini",
hyperparameters={"n_epochs": 3},
)
# The resulting model ID looks like:
# ft:gpt-4o-mini:your-org:custom-name:abc123
print(job.id)
Fine-tuning is for when you've genuinely exhausted prompting — when you need consistency at scale, across thousands of calls, that a prompt can't reliably deliver.
◉ INTERACTIVE
Base model
|
Fine-tuned
|
Fine-tuning changes how a model responds, not what it fundamentally knows.
LoRA and the cheaper path
The original way to fine-tune a model was full fine-tuning: update every parameter. For a 70-billion-parameter model, that means moving 70 billion numbers — expensive, slow, and you need a copy of the whole model per task.
(Low-Rank Adaptation)[·] changed that. Instead of training all the parameters, LoRA freezes the base model and trains a small number of additional parameters layered on top — often less than 1% of the original size. The result is almost as good as full fine-tuning, costs a fraction, and you can swap LoRA "adapters" in and out for different tasks without touching the base model.
This is what most modern fine-tuning APIs actually do under the hood. When OpenAI or Anthropic offer fine-tuning, you're almost always getting some flavor of parameter-efficient fine-tuning, not a full retrain.
When fine-tuning makes sense
You have a consistent format or structure requirement. If every response needs to be valid JSON with a specific schema, or formatted in a particular way that's hard to enforce through prompting alone, fine-tuning can make that format reliable.
You need a very specific tone. If your brand has a distinctive voice — particular vocabulary, specific warmth, a writing style that's hard to describe but easy to demonstrate — fine-tuning on examples transfers it more reliably than prompt instructions.
You're making thousands of calls. A fine-tuned model can often achieve good results with shorter prompts than a base model, because some of the context is baked in. At high volume, that's real cost savings.[·]
When it doesn't
You want the model to learn facts. Fine-tuning is poor at reliably encoding new factual knowledge. The model may appear to learn facts during training but will hallucinate around them in unpredictable ways. For factual grounding, use — give the model the information at inference time, don't try to bake it in.
Your prompt already gets you there. This is the majority of cases. If careful prompting gets you 90% of the way, fine-tuning is expensive and slow for marginal gains. The data collection and training process takes significant time; the prompt can be iterated in minutes.
Your use case is still changing. Fine-tuned models are snapshots. Every time your requirements shift, you need new training data and a new fine-tuning run. A prompt is cheaper to update.
What it doesn't do
Fine-tuning doesn't make a smaller model as capable as a larger one. If the base model can't perform a reasoning task, fine-tuning won't unlock that capability — it wasn't there to be taught.
It also doesn't give you complete control. The fine-tuned model's underlying tendencies come from , which is orders of magnitude larger than your dataset. You're nudging behavior within the space the base model defines, not rewriting it from scratch.
The analogy holds: fine-tuning shapes professional habits, not raw intelligence. Used for the right reasons — format, tone, consistent style at scale — it delivers. Used as a substitute for the harder work of designing good prompts and choosing the right model, it mostly costs money and time.
Know which situation you're in before you start collecting training data.