CPT Records
CPT Records
Continued pretraining (CPT) records are used for domain adaptation by training on documents without instruction-response structure.
Requirements
- No assistant or tool messages - Only
system,user, anddocumentroles allowed - Documents are processed as raw text for language modeling
Basic Example
{
"id": "cpt-001",
"messages": [
{
"role": "document",
"content": [
{ "type": "text", "text": "Introduction to Machine Learning\n\nMachine learning is a subset of artificial intelligence that enables systems to learn from data. Rather than being explicitly programmed, these systems improve through experience..." }
]
}
]
}
Multiple Document Chunks
Split long documents into related chunks:
{
"id": "cpt-002",
"messages": [
{
"role": "document",
"content": [
{ "type": "text", "text": "Chapter 1: Data Preprocessing\n\nBefore training any model, data must be cleaned and prepared..." }
]
},
{
"role": "document",
"content": [
{ "type": "text", "text": "Chapter 2: Feature Engineering\n\nFeature engineering is the process of using domain knowledge to create inputs..." }
]
}
]
}
With Metadata Context
Include structured context alongside documents:
{
"id": "cpt-003",
"messages": [
{
"role": "document",
"content": [
{ "type": "json", "data": { "title": "Research Paper", "authors": ["J. Smith"], "year": 2024 } }
]
},
{
"role": "document",
"content": [
{ "type": "text", "text": "Abstract: This paper presents a novel approach to..." }
]
}
]
}
Multimodal Documents
Include images or other media with documents:
{
"id": "cpt-multimodal-001",
"messages": [
{
"role": "document",
"content": [
{ "type": "text", "text": "Product Manual\n\nThe device shown below is the Model X-100:" },
{ "type": "image", "ref": { "asset_id": "product-diagram-001" } },
{ "type": "text", "text": "To operate, first ensure the power supply is connected..." }
]
}
]
}
Use Cases
- Domain adaptation - Train on specialized corpora (legal, medical, scientific)
- Language expansion - Add training data for new languages
- Knowledge injection - Incorporate up-to-date information
- Style adaptation - Train on content with specific writing styles
Best Practices
- Clean text - Remove boilerplate, headers, footers
- Preserve structure - Keep meaningful formatting (paragraphs, lists)
- Chunk thoughtfully - Split at natural boundaries (sections, paragraphs)
- Deduplicate - Remove exact or near-duplicate documents
- Quality filter - Exclude low-quality or noisy content