How fine-tuning works
You start with a base model (Llama 3, Mistral, GPT-4o, Claude Haiku) and a curated dataset of input-output pairs in your target format. The training process adjusts the model's weights using supervised learning so it produces outputs closer to your examples. Modern fine-tuning uses parameter-efficient methods (LoRA, QLoRA) that update only a small fraction of weights, slashing cost and storage.
Fine-tuning vs prompting vs RAG
- Prompting: zero training cost, instant updates, expensive per inference, limited by context window.
- RAG: cheap to update (just add documents), grounded in your data, requires retrieval infrastructure.
- Fine-tuning: high training cost, fast inference, embeds knowledge in weights, hard to update.
- Hybrid: fine-tune for style and structure, RAG for facts. The 2026 default for most production systems.
When fine-tuning is worth it
- You have 1,000+ high-quality labeled examples (50K+ for serious domain transfer).
- The task is narrow and the output format is consistent across examples.
- Inference cost dominates — you serve millions of requests and per-token savings add up.
- Latency matters — a fine-tuned smaller model beats a prompted larger one on speed.
- You need to embed proprietary style or terminology the base model does not know.
When to skip fine-tuning
For most SMB use cases — under 10,000 requests per day, fewer than 500 hand-labeled examples — prompting plus RAG outperforms fine-tuning at one-tenth the engineering cost. OpenAI and Anthropic both recommend starting with prompting, adding RAG when you need fresh data, and only fine-tuning when the cost or latency math forces the move. Reinforcement learning from human feedback (RLHF) and constitutional AI methods are specialized fine-tuning variants used by frontier labs, not typical for application teams.
What it means for your business
If a vendor pitches fine-tuning as the answer to your problem, ask how many labeled examples they have, what the inference volume needs to be, and why prompting plus RAG would not work. The answer separates serious teams from cargo-cult ones.
Related terms
- Large Language Model (LLM) — A Large Language Model is a transformer-based neural network trained on trillions of tokens to predict the next token. Definition, key models, and business use.
- Prompt Engineering — Prompt engineering is the practice of writing instructions to LLMs to get reliable, structured output. Definition, techniques, and when to stop optimizing.
- Retrieval-Augmented Generation (RAG) — RAG is the technique of fetching documents from a database and feeding them to an LLM before it answers. Definition, architecture, and SMB use cases.
- AI Evaluation — AI evaluation is how you measure whether an AI system actually works. Definition, methods, and why evals are the bottleneck in production AI.
- Constitutional AI — Constitutional AI is Anthropic's method for training models to be helpful, harmless, and honest using a written constitution and AI feedback. Definition explained.