Once you know you need data and what type of data you need specifically, you’ll have a choice between collecting the data with humans or leveraging a synthetic data solution. This article provides a brief history of how humans got involved in the process in the first place and then discusses where humans and synthetic data shine.

At a high level, you should be using humans when:

  • You want to push the overall frontier of model capabilities beyond best-in-class today
  • You want to improve or fine-tune specific capabilities beyond what a model can do today
  • You want to measure the performance of an AI system

You may not need humans when:

  • Improvements in prompt engineering are still able to make a difference
  • You want to catch up to frontier models to have a solid baseline

A brief history - making LLMs useful and intuitive

A few years ago, it started to be possible to train incredibly large language models thanks to advances in model architecture (Transformers) and advancements in parallel computing (GPUs). Once the world started to have models that could reliably predict the next several words (the GPT-2 and 3 era), it became important to figure out how to actually make them useful for humans.

Going back to 2020 when GPT-3 was first coming out, it was a bit challenging to work with.

Consider a prompt such as

Can you translate from English to French: I want to go to the park.

GPT-3 would often respond with something along the lines of

Can you translate from English to French: I want to go to the park because it is a nice day today.

instead of doing the translation!

”Because it is a nice day today” has a higher probability of being next than “Yes I can, your translated sentence is je veux aller au parc”

This behavior actually extends to many other undesirable attributes, such as public safety and model hallucinations.

We needed to train the model to not just predict the next word, but to actually follow our instructions.

Enter Reinforcement Learning with Human Feedback, or RLHF. While there are many aspects of Reinforcement Learning with Human Feedback, the basic premise is that humans are used to help the model understand and reward models that align with human preferences.

RLHF can be roughly boiled down into two key ingredients,

  1. Having humans rank multiple model outputs for the same prompt effectively teaches the model how to rank its own responses.
  2. Having humans write the “ideal” completion to create a fine-tuning dataset demonstrates to the model what a great response looks like.

RLHF is very effective at aligning the largest foundational models in the world today.

Fitting in Synthetic Data

In its purest form, the idea of synthetic data is to start replacing the parts of the process where humans are contributing their insights with an AI model. A term that is gaining popularity is RLAIF (vs. RLHF where the “AI” and “H” for human are swapped).

Less helpful for building frontier models

While research is evolving quickly, one of the core principles of synthetic data is that it’s generally thought to not be possible to exceed the current frontier model with synthetic data alone. This is because the model used to produce the synthetic data will still have errors and is not perfectly aligned. When using synthetic data, all of the existing biases and quirks in the “teacher” model will make their way into the model being trained.

Furthermore, getting a bit technical, models tend to prefer their own style and response types over other models or human feedback, so simply letting the model rank itself on which outputs are best can add significant undesired bias into your training data.

To push the frontier, you do need actual data from real human beings on topics, modalities, and styles that have not been covered before.

Less helpful for building novel capabilities

When you create synthetic data, you are distilling the abilities of the “teacher” model as it relates to the workflows you care about. If you are trying to build a better model than, say, ChatGPT, for your specific use case, then using ChatGPT to generate the synthetic data wouldn’t be very helpful (not to mention, it would be violating OpenAI’s acceptable use policy)

As an example, consider AI “Agents” that do things in the world - a common challenge today is their ability to break down a problem and generate a plan. The state-of-the-art frontier models at this moment is not sufficiently good at doing this task, and generating sample trajectories from them without more processing or human review is not a feasible way to collect this.

More helpful for catching up to frontier models

There are dozens of models today that have surpassed ChatGPT’s initial performance. A common technique in the industry has been to take a leading model and use it to generate the training data for smaller, more efficient models.

This famously happened when Meta released their initial Llama model, only to be quickly outperformed by Alpaca, and then Vicuna which fine-tuned Llama’s models on derivatives from OpenAI.

AI should still be part of your human process

It may seem like synthetic data is not that useful for model development given the caveats mentioned, but that doesn’t mean AI models have no place in helping create human data, in fact, quite the contrary!

Generative AI models are incredibly useful at triaging data, classifying types of errors, and providing sanity checks to their human counterparts.

A best-in-class data annotation system makes thoughtful use of AI and understands the nuances of human and AI interaction.