RLHF (Reinforcement Learning from Human Feedback)

Level 3

Short Description

A training method that uses human preferences to shape a model's behavior, widely used to make LLMs more helpful and safe.

Friendly Description: RLHF is a way of teaching AI by asking real people which of its answers are better. The model produces a few responses to the same question, humans pick their favorite, and the AI learns to produce more answers like the favorites. It's how many modern chatbots learned to be helpful, honest, and pleasant rather than just technically correct.

Example: When training a chatbot, the team might show two AI-written replies side by side and ask annotators, "Which of these would you rather receive?" The chatbot uses thousands of those preference judgments to fine-tune its style, becoming friendlier, clearer, and more useful with each round.