Specification Overview

Specification Overview

LLM Training Data v1 is designed for practical, large-scale fine-tuning workflows. This document explains the design philosophy and architectural decisions.

Design Goals

1. Massive Scale Support

Modern fine-tuning datasets can contain millions or billions of examples. The format supports:

  • JSONL sharding - Split data across multiple files for parallel processing
  • Compression - Native support for gzip and zstd compression
  • Asset deduplication - Centralized asset index to avoid repeating large media URIs

2. Multiple Training Objectives

Different training approaches require different data structures:

ObjectiveUse CaseSchema
SFTTeaching the model to follow instructionssft_record
PreferenceDPO, RLHF, reward modelingpreference_record
RFTReinforcement fine-tuning with verifiersrft_record
CPTContinued pretraining on documentscpt_record

3. Rich Content Model

Training data increasingly includes:

  • Multimodal content - Images, audio, video, documents
  • Tool use - Function calls and results
  • Reasoning traces - Chain-of-thought, scratchpads
  • Structured data - JSON objects, code

The content part model supports all of these uniformly.

4. Reproducibility

Every artifact can be verified:

  • SHA-256 checksums on shards and assets
  • Explicit record counts
  • Deterministic file paths

Architecture

Metadata Layer

The metadata.json file serves as the entry point:

metadata.json
  ├── dataset (identity)
  ├── data_card (optional link to LLM Data Card)
  ├── defaults (system message, toolset, reasoning policy)
  ├── toolsets (tool definitions)
  ├── asset_index (pointer to assets.jsonl)
  └── files (shard listings)

Data Layer

Each file entry points to JSONL shards:

files[]
  ├── split (train/validation/test)
  ├── objective (sft/preference/rft/cpt)
  └── shards[]
        ├── path
        ├── compression
        ├── record_count
        └── sha256

Record Layer

Each JSONL line is a record with messages:

record
  ├── id
  ├── messages[]
  │     ├── role
  │     └── content[]
  │           ├── type
  │           └── (type-specific fields)
  ├── toolset_id (optional override)
  ├── candidates (preference only)
  ├── reference (rft only)
  └── metadata

Schema Validation

The schema uses JSON Schema 2020-12 with:

  • Strict mode - additionalProperties: false ensures no unknown fields
  • Conditional validation - Type-specific requirements via if/then
  • Separate record schemas - Each objective has its own $def

Relationship to LLM Data Card

LLM Training Data focuses on format while LLM Data Card focuses on documentation:

AspectLLM Training DataLLM Data Card
PurposeData structureDataset documentation
ContentRecords, messages, toolsRights, provenance, safety
Filesmetadata.json + JSONLcard.json

Use both together for complete dataset publishing.

Version History

  • v1 (December 2024) - Initial release with SFT, preference, RFT, CPT support