Changelog

Changelog

All notable changes to the LLM Training Data specification.

v1 (December 2024)

Initial release of the LLM Training Data specification.

Features

  • Core format

    • JSONL-based record format with metadata.json manifest
    • Support for gzip and zstd compression
    • SHA-256 checksums for integrity verification
  • Training objectives

    • Supervised fine-tuning (SFT)
    • Preference training (DPO/RLHF)
    • Reinforcement fine-tuning (RFT)
    • Continued pretraining (CPT)
  • Content types

    • Text and reasoning traces
    • JSON structured data
    • Images, audio, video, documents
    • Tool calls and tool results
  • Dataset features

    • Toolset definitions at dataset level
    • Default system messages
    • Reasoning policy configuration
    • Asset deduplication via assets.jsonl
  • Integration

    • LLM Data Card cross-referencing
    • Support for embedded or referenced data cards

Schema

  • JSON Schema 2020-12 compliant
  • Strict validation with additionalProperties: false
  • Separate record schemas for each training objective