Fitting in Synthetic Data

A term that is gaining popularity is RLAIF (vs. RLHF where the “AI” and “H” for human are swapped).

Less helpful for building frontier models

While research is evolving quickly, one of the core principles of synthetic data is that it’s generally thought to not be possible to exceed the current frontier model with synthetic data alone. This is because the model used to produce the synthetic data will still have errors and is not perfectly aligned. When using synthetic data, all of the existing biases and quirks in the “teacher” model will make their way into the model being trained. Furthermore, getting a bit technical, models tend to prefer their own style and response types over other models or human feedback, so simply letting the model rank itself on which outputs are best can add significant undesired bias into your training data. To push the frontier, you do need actual data from real human beings on topics, modalities, and styles that have not been covered before.

Less helpful for building novel capabilities

When you create synthetic data, you are distilling the abilities of the “teacher” model as it relates to the workflows you care about. If you are trying to build a better model than, say, ChatGPT, for your specific use case, then using ChatGPT to generate the synthetic data wouldn’t be very helpful (not to mention, it would be violating OpenAI’s acceptable use policy) As an example, consider AI “Agents” that do things in the world - a common challenge today is their ability to break down a problem and generate a plan. The state-of-the-art frontier models at this moment is not sufficiently good at doing this task, and generating sample trajectories from them without more processing or human review is not a feasible way to collect this.

More helpful for catching up to frontier models

There are dozens of models today that have surpassed ChatGPT’s initial performance. A common technique in the industry has been to take a leading model and use it to generate the training data for smaller, more efficient models. This famously happened when Meta released their initial Llama model, only to be quickly outperformed by Alpaca, and then Vicuna which fine-tuned Llama’s models on derivatives from OpenAI.

AI should still be part of your human process

It may seem like synthetic data is not that useful for model development given the caveats mentioned, but that doesn’t mean AI models have no place in helping create human data, in fact, quite the contrary! Generative AI models are incredibly useful at triaging data, classifying types of errors, and providing sanity checks to their human counterparts. A best-in-class data annotation system makes thoughtful use of AI and understands the nuances of human-AI interaction.

When Humans Are Needed

Black Box vs. Open Box

Types of Data

Fitting in Synthetic Data

Less helpful for building frontier models

Less helpful for building novel capabilities

More helpful for catching up to frontier models

AI should still be part of your human process

When Humans Are Needed

Black Box vs. Open Box

Types of Data

​Less helpful for building frontier models

​Less helpful for building novel capabilities

​More helpful for catching up to frontier models

​AI should still be part of your human process

Less helpful for building frontier models

Less helpful for building novel capabilities

More helpful for catching up to frontier models

AI should still be part of your human process