Skip to main content
Pre-training involves feeding the model vast amounts of unstructured data to learn the structure and nuances of the language(s) in the dataset. The model predicts parts of the data withheld from it, and its parameters are updated based on a loss function, which measures prediction errors. This step is typically performed using diverse sources like web pages, books, conversations, scientific articles, and code repositories. Two commonly used locations of pertaining data are “Common Crawl” and “The Pile”. You generally won’t involve expert annotators directly in creating this data, though they can help with curation. There are exceptions, like in this paper which explores using human preferences in pre-training data.
I