A few years ago, it started to be possible to train incredibly large language models thanks to advances in model architecture (Transformers) and advancements in parallel computing (GPUs). Once the world started to have models that could reliably predict the next several words (the GPT-2 and 3 era), it became important to figure out how to actually make them useful for humans.
Going back to 2020 when GPT-3 was first coming out, it was a bit challenging to work with.
Consider a prompt such as
Can you translate from English to French: I want to go to the park.
GPT-3 would often respond with something along the lines of
Can you translate from English to French: I want to go to the park because it is a nice day today.
instead of doing the translation!
“Because it is a nice day today” has a higher probability of being next than “Yes I can, your translated sentence is je veux aller au parc”This behavior actually extends to many other undesirable attributes, such as public safety and model hallucinations.We needed to train the model to not just predict the next word, but to actually follow our instructions.
Enter Reinforcement Learning with Human Feedback, or RLHF. While there are many aspects of Reinforcement Learning with Human Feedback, the basic premise is that humans are used to help the model understand and reward models that align with human preferences.
RLHF can be roughly boiled down into two key ingredients,
- Having humans rank multiple model outputs for the same prompt effectively teaches the model how to rank its own responses.
- Having humans write the “ideal” completion to create a fine-tuning dataset demonstrates to the model what a great response looks like.
RLHF is very effective at aligning the largest foundational models in the world today.