A model trained purely to predict the next token on internet text will reproduce whatever patterns dominate the internet — including misinformation, harmful content, and persuasion techniques. The "alignment" challenge is shaping the model's behaviour to be helpful, honest, and harmless without breaking its underlying capabilities.
RLHF (Reinforcement Learning from Human Feedback) is the dominant technique. After base training, human raters compare pairs of model outputs and indicate which is better. A reward model learns from these preferences. The main model is then optimised to produce outputs the reward model rates highly. The result is a model that's better at following instructions, more helpful, and less likely to produce harmful content.
Constitutional AI (CAI), developed by Anthropic, takes a different approach: rather than relying entirely on human feedback, it trains the model on a set of written principles (a "constitution") and has the model critique and revise its own outputs against those principles. This reduces reliance on expensive human labelling and makes the model's values more explicit and auditable.
⚠ The Alignment Tax
There's an ongoing debate about whether safety training slightly reduces raw capability — whether a model optimised to refuse harmful requests also becomes slightly worse at edge cases near those refusals. This is called the "alignment tax." Evidence is mixed and decreasing as training techniques improve, but it partly explains why some users seek out fine-tuned models with fewer restrictions for specific research applications.
🔬 Why This Matters for Trust
Understanding RLHF explains why Claude, GPT, and Gemini behave the way they do when you ask them to help with sensitive topics. They're not running a keyword filter — they've learned patterns of what constitutes helpful vs. harmful responses from millions of human preference judgements. The model "wants" to be helpful; its refusals represent a trained judgement that the request risks harm that outweighs helpfulness. Whether that judgement is calibrated correctly for your use case is a legitimate ongoing debate.