Specification Overview

LLM Training Data v1 is designed for practical, large-scale fine-tuning workflows. This document explains the design philosophy and architectural decisions.

Design Goals

1. Massive Scale Support

Modern fine-tuning datasets can contain millions or billions of examples. The format supports:

JSONL sharding - Split data across multiple files for parallel processing
Compression - Native support for gzip and zstd compression
Asset deduplication - Centralized asset index to avoid repeating large media URIs

2. Multiple Training Objectives

Different training approaches require different data structures:

Objective	Use Case	Schema
SFT	Teaching the model to follow instructions	`sft_record`
Preference	DPO, RLHF, reward modeling	`preference_record`
RFT	Reinforcement fine-tuning with verifiers	`rft_record`
CPT	Continued pretraining on documents	`cpt_record`

3. Rich Content Model

Training data increasingly includes:

Multimodal content - Images, audio, video, documents
Tool use - Function calls and results
Reasoning traces - Chain-of-thought, scratchpads
Structured data - JSON objects, code

The content part model supports all of these uniformly.

4. Reproducibility

Every artifact can be verified:

SHA-256 checksums on shards and assets
Explicit record counts
Deterministic file paths

Architecture

Metadata Layer

The metadata.json file serves as the entry point:

metadata.json
  ├── dataset (identity)
  ├── data_card (optional link to LLM Data Card)
  ├── defaults (system message, toolset, reasoning policy)
  ├── toolsets (tool definitions)
  ├── asset_index (pointer to assets.jsonl)
  └── files (shard listings)

Data Layer

Each file entry points to JSONL shards:

files[]
  ├── split (train/validation/test)
  ├── objective (sft/preference/rft/cpt)
  └── shards[]
        ├── path
        ├── compression
        ├── record_count
        └── sha256

Record Layer

Each JSONL line is a record with messages:

record
  ├── id
  ├── messages[]
  │     ├── role
  │     └── content[]
  │           ├── type
  │           └── (type-specific fields)
  ├── toolset_id (optional override)
  ├── candidates (preference only)
  ├── reference (rft only)
  └── metadata

Schema Validation

The schema uses JSON Schema 2020-12 with:

Strict mode - additionalProperties: false ensures no unknown fields
Conditional validation - Type-specific requirements via if/then
Separate record schemas - Each objective has its own $def

Relationship to LLM Data Card

LLM Training Data focuses on format while LLM Data Card focuses on documentation:

Aspect	LLM Training Data	LLM Data Card
Purpose	Data structure	Dataset documentation
Content	Records, messages, tools	Rights, provenance, safety
Files	metadata.json + JSONL	card.json

Use both together for complete dataset publishing.

Version History

v1 (December 2025) - Initial release with SFT, preference, RFT, CPT support