LLM Training Data v1

LLM Training Data v1

LLM Training Data is a machine-readable format for fine-tuning datasets. It provides a standardized structure for supervised fine-tuning (SFT), preference training (DPO/RLHF), reinforcement fine-tuning (RFT), and continued pretraining (CPT) data.

Key Features

  • Multiple training objectives - SFT, preference, RFT, and CPT with objective-specific record schemas
  • Multimodal support - Text, images, audio, video, and documents with asset deduplication
  • Tool use - First-class support for tool calls and results in training data
  • Reasoning traces - Native support for chain-of-thought and reasoning content
  • Massive scale - JSONL sharding with compression (gzip, zstd) for datasets of any size
  • Data Card integration - Bidirectional linking with LLM Data Card for provenance and rights

File Structure

A typical LLM Training Data dataset consists of:

my-dataset/
  metadata.json          # Dataset configuration (this spec)
  assets.jsonl           # Optional asset index for media deduplication
  data/
    sft-train-00000.jsonl.gz
    sft-train-00001.jsonl.gz
    pref-train-00000.jsonl.gz
    ...

Quick Example

metadata.json:

{
  "schema_version": "llm-training-data/v1",
  "dataset": {
    "name": "my-sft-dataset",
    "description": "Example SFT dataset"
  },
  "files": [
    {
      "split": "train",
      "objective": "sft",
      "shards": [
        { "path": "data/sft-train-00000.jsonl.gz", "compression": "gzip" }
      ]
    }
  ]
}

SFT record (one line in the JSONL file):

{
  "id": "example-001",
  "messages": [
    { "role": "user", "content": [{ "type": "text", "text": "What is 2+2?" }] },
    { "role": "assistant", "content": [{ "type": "text", "text": "2+2 equals 4." }] }
  ]
}

Training Objectives

ObjectiveDescriptionKey Requirement
sftSupervised fine-tuningAt least one assistant message
preferenceDPO/RLHF preference dataChosen and rejected candidates
rftReinforcement fine-tuningReference object with grading criteria
cptContinued pretrainingDocument-only, no assistant messages

Integration with LLM Data Card

LLM Training Data is designed to work alongside LLM Data Card for complete dataset documentation:

  • LLM Data Card documents rights, provenance, safety, and access information
  • LLM Training Data specifies the actual data format and structure

You can reference a Data Card from your training data:

{
  "schema_version": "llm-training-data/v1",
  "data_card": {
    "uri": "https://datapass.meetkai.ai/registry/my-dataset/1.0.0/card.json"
  },
  "dataset": { ... },
  "files": [ ... ]
}

Next Steps