Changelog
Changelog
All notable changes to the LLM Training Data specification.
v1 (December 2024)
Initial release of the LLM Training Data specification.
Features
-
Core format
- JSONL-based record format with metadata.json manifest
- Support for gzip and zstd compression
- SHA-256 checksums for integrity verification
-
Training objectives
- Supervised fine-tuning (SFT)
- Preference training (DPO/RLHF)
- Reinforcement fine-tuning (RFT)
- Continued pretraining (CPT)
-
Content types
- Text and reasoning traces
- JSON structured data
- Images, audio, video, documents
- Tool calls and tool results
-
Dataset features
- Toolset definitions at dataset level
- Default system messages
- Reasoning policy configuration
- Asset deduplication via assets.jsonl
-
Integration
- LLM Data Card cross-referencing
- Support for embedded or referenced data cards
Schema
- JSON Schema 2020-12 compliant
- Strict validation with
additionalProperties: false - Separate record schemas for each training objective