RLHF
A training method where humans rank model outputs to teach the model what good looks like.
We are working on a detailed page for RLHF - covering why it matters, how it works, related terms, and the tools that use it.
Related terms
From the glossaryFrequently asked questions
What kind of feedback does RLHF use?+
Human raters compare pairs of model outputs and select the better one. This preference signal trains a reward model, which then guides reinforcement learning to make the base model produce outputs more like the preferred ones.
Is RLHF used in all major models?+
Most frontier chat models, including GPT-4, Claude, and Gemini, use some form of human feedback alignment. The exact method varies and newer techniques like DPO and RLAIF are also emerging.
What are the limitations of RLHF?+
It is expensive to collect human preferences at scale, annotator disagreements introduce noise, and models can learn to game the reward model rather than genuinely improving.