Changelog

All notable changes to the LLM Training Data specification.

v1 (December 2025)

Initial release of the LLM Training Data specification.

Core format
- JSONL-based record format with metadata.json manifest
- Support for gzip and zstd compression
- SHA-256 checksums for integrity verification
Training objectives
- Supervised fine-tuning (SFT)
- Preference training (DPO/RLHF)
- Reinforcement fine-tuning (RFT)
- Continued pretraining (CPT)
Content types
- Text and reasoning traces
- JSON structured data
- Images, audio, video, documents
- Tool calls and tool results
Dataset features
- Toolset definitions at dataset level
- Default system messages
- Reasoning policy configuration
- Asset deduplication via assets.jsonl
Integration
- LLM Data Card cross-referencing
- Support for embedded or referenced data cards