CPT Records

Continued pretraining (CPT) records are used for domain adaptation by training on documents without instruction-response structure.

Requirements

No assistant or tool messages - Only system, user, and document roles allowed
Documents are processed as raw text for language modeling

Basic Example

{
  "id": "cpt-001",
  "messages": [
    {
      "role": "document",
      "content": [
        {
          "type": "text",
          "text": "Introduction to Machine Learning\n\nMachine learning is a subset of artificial intelligence that enables systems to learn from data. Rather than being explicitly programmed, these systems improve through experience..."
        }
      ]
    }
  ]
}

Multiple Document Chunks

Split long documents into related chunks:

{
  "id": "cpt-002",
  "messages": [
    {
      "role": "document",
      "content": [
        {
          "type": "text",
          "text": "Chapter 1: Data Preprocessing\n\nBefore training any model, data must be cleaned and prepared..."
        }
      ]
    },
    {
      "role": "document",
      "content": [
        {
          "type": "text",
          "text": "Chapter 2: Feature Engineering\n\nFeature engineering is the process of using domain knowledge to create inputs..."
        }
      ]
    }
  ]
}

With Metadata Context

Include structured context alongside documents:

{
  "id": "cpt-003",
  "messages": [
    {
      "role": "document",
      "content": [{ "type": "json", "data": { "title": "Research Paper", "authors": ["J. Smith"], "year": 2024 } }]
    },
    {
      "role": "document",
      "content": [{ "type": "text", "text": "Abstract: This paper presents a novel approach to..." }]
    }
  ]
}

Multimodal Documents

Include images or other media with documents:

{
  "id": "cpt-multimodal-001",
  "messages": [
    {
      "role": "document",
      "content": [
        { "type": "text", "text": "Product Manual\n\nThe device shown below is the Model X-100:" },
        { "type": "image", "ref": { "asset_id": "product-diagram-001" } },
        { "type": "text", "text": "To operate, first ensure the power supply is connected..." }
      ]
    }
  ]
}

Use Cases

Domain adaptation - Train on specialized corpora (legal, medical, scientific)
Language expansion - Add training data for new languages
Knowledge injection - Incorporate up-to-date information
Style adaptation - Train on content with specific writing styles

Best Practices

Clean text - Remove boilerplate, headers, footers
Preserve structure - Keep meaningful formatting (paragraphs, lists)
Chunk thoughtfully - Split at natural boundaries (sections, paragraphs)
Deduplicate - Remove exact or near-duplicate documents
Quality filter - Exclude low-quality or noisy content