Specification Overview
Specification Overview
LLM Training Data v1 is designed for practical, large-scale fine-tuning workflows. This document explains the design philosophy and architectural decisions.
Design Goals
1. Massive Scale Support
Modern fine-tuning datasets can contain millions or billions of examples. The format supports:
- JSONL sharding - Split data across multiple files for parallel processing
- Compression - Native support for gzip and zstd compression
- Asset deduplication - Centralized asset index to avoid repeating large media URIs
2. Multiple Training Objectives
Different training approaches require different data structures:
| Objective | Use Case | Schema |
|---|---|---|
| SFT | Teaching the model to follow instructions | sft_record |
| Preference | DPO, RLHF, reward modeling | preference_record |
| RFT | Reinforcement fine-tuning with verifiers | rft_record |
| CPT | Continued pretraining on documents | cpt_record |
3. Rich Content Model
Training data increasingly includes:
- Multimodal content - Images, audio, video, documents
- Tool use - Function calls and results
- Reasoning traces - Chain-of-thought, scratchpads
- Structured data - JSON objects, code
The content part model supports all of these uniformly.
4. Reproducibility
Every artifact can be verified:
- SHA-256 checksums on shards and assets
- Explicit record counts
- Deterministic file paths
Architecture
Metadata Layer
The metadata.json file serves as the entry point:
metadata.json
├── dataset (identity)
├── data_card (optional link to LLM Data Card)
├── defaults (system message, toolset, reasoning policy)
├── toolsets (tool definitions)
├── asset_index (pointer to assets.jsonl)
└── files (shard listings)
Data Layer
Each file entry points to JSONL shards:
files[]
├── split (train/validation/test)
├── objective (sft/preference/rft/cpt)
└── shards[]
├── path
├── compression
├── record_count
└── sha256
Record Layer
Each JSONL line is a record with messages:
record
├── id
├── messages[]
│ ├── role
│ └── content[]
│ ├── type
│ └── (type-specific fields)
├── toolset_id (optional override)
├── candidates (preference only)
├── reference (rft only)
└── metadata
Schema Validation
The schema uses JSON Schema 2020-12 with:
- Strict mode -
additionalProperties: falseensures no unknown fields - Conditional validation - Type-specific requirements via
if/then - Separate record schemas - Each objective has its own
$def
Relationship to LLM Data Card
LLM Training Data focuses on format while LLM Data Card focuses on documentation:
| Aspect | LLM Training Data | LLM Data Card |
|---|---|---|
| Purpose | Data structure | Dataset documentation |
| Content | Records, messages, tools | Rights, provenance, safety |
| Files | metadata.json + JSONL | card.json |
Use both together for complete dataset publishing.
Version History
- v1 (December 2024) - Initial release with SFT, preference, RFT, CPT support