AI Glossary · Last reviewed May 2026

RLHF

· Reinforcement Learning from Human Feedback

Hand-written by a real person. Reviewed against current practice in May 2026.

Definition

A training method where humans rank model outputs to teach the model what good looks like.

Full write-up coming soon

We are working on a detailed page for RLHF - covering why it matters, how it works, related terms, and the tools that use it.

Related terms

From the glossary

Fine-tuning

LLM

Frequently asked questions

What kind of feedback does RLHF use?+

Human raters compare pairs of model outputs and select the better one. This preference signal trains a reward model, which then guides reinforcement learning to make the base model produce outputs more like the preferred ones.

Is RLHF used in all major models?+

Most frontier chat models, including GPT-4, Claude, and Gemini, use some form of human feedback alignment. The exact method varies and newer techniques like DPO and RLAIF are also emerging.

What are the limitations of RLHF?+

It is expensive to collect human preferences at scale, annotator disagreements introduce noise, and models can learn to game the reward model rather than genuinely improving.

Explore other terms

From the glossary

AI Agents

A program that takes goals and figures out the steps to reac...

API

The way one piece of software talks to another.

Chain of Thought

A prompting technique where the model reasons out loud, step...

Context Window

How much text a model can read at once.

Embeddings

Numeric fingerprints of text or images that let computers me...

Few-shot Learning

Showing a model two to five examples in the prompt so it fol...

View all 22 terms