CPT Records

CPT Records

Continued pretraining (CPT) records are used for domain adaptation by training on documents without instruction-response structure.

Requirements

  • No assistant or tool messages - Only system, user, and document roles allowed
  • Documents are processed as raw text for language modeling

Basic Example

{
  "id": "cpt-001",
  "messages": [
    {
      "role": "document",
      "content": [
        { "type": "text", "text": "Introduction to Machine Learning\n\nMachine learning is a subset of artificial intelligence that enables systems to learn from data. Rather than being explicitly programmed, these systems improve through experience..." }
      ]
    }
  ]
}

Multiple Document Chunks

Split long documents into related chunks:

{
  "id": "cpt-002",
  "messages": [
    {
      "role": "document",
      "content": [
        { "type": "text", "text": "Chapter 1: Data Preprocessing\n\nBefore training any model, data must be cleaned and prepared..." }
      ]
    },
    {
      "role": "document",
      "content": [
        { "type": "text", "text": "Chapter 2: Feature Engineering\n\nFeature engineering is the process of using domain knowledge to create inputs..." }
      ]
    }
  ]
}

With Metadata Context

Include structured context alongside documents:

{
  "id": "cpt-003",
  "messages": [
    {
      "role": "document",
      "content": [
        { "type": "json", "data": { "title": "Research Paper", "authors": ["J. Smith"], "year": 2024 } }
      ]
    },
    {
      "role": "document",
      "content": [
        { "type": "text", "text": "Abstract: This paper presents a novel approach to..." }
      ]
    }
  ]
}

Multimodal Documents

Include images or other media with documents:

{
  "id": "cpt-multimodal-001",
  "messages": [
    {
      "role": "document",
      "content": [
        { "type": "text", "text": "Product Manual\n\nThe device shown below is the Model X-100:" },
        { "type": "image", "ref": { "asset_id": "product-diagram-001" } },
        { "type": "text", "text": "To operate, first ensure the power supply is connected..." }
      ]
    }
  ]
}

Use Cases

  • Domain adaptation - Train on specialized corpora (legal, medical, scientific)
  • Language expansion - Add training data for new languages
  • Knowledge injection - Incorporate up-to-date information
  • Style adaptation - Train on content with specific writing styles

Best Practices

  1. Clean text - Remove boilerplate, headers, footers
  2. Preserve structure - Keep meaningful formatting (paragraphs, lists)
  3. Chunk thoughtfully - Split at natural boundaries (sections, paragraphs)
  4. Deduplicate - Remove exact or near-duplicate documents
  5. Quality filter - Exclude low-quality or noisy content