LLM Training Data v1
LLM Training Data v1
LLM Training Data is a machine-readable format for fine-tuning datasets. It provides a standardized structure for supervised fine-tuning (SFT), preference training (DPO/RLHF), reinforcement fine-tuning (RFT), and continued pretraining (CPT) data.
Key Features
- Multiple training objectives - SFT, preference, RFT, and CPT with objective-specific record schemas
- Multimodal support - Text, images, audio, video, and documents with asset deduplication
- Tool use - First-class support for tool calls and results in training data
- Reasoning traces - Native support for chain-of-thought and reasoning content
- Massive scale - JSONL sharding with compression (gzip, zstd) for datasets of any size
- Data Card integration - Bidirectional linking with LLM Data Card for provenance and rights
File Structure
A typical LLM Training Data dataset consists of:
my-dataset/
metadata.json # Dataset configuration (this spec)
assets.jsonl # Optional asset index for media deduplication
data/
sft-train-00000.jsonl.gz
sft-train-00001.jsonl.gz
pref-train-00000.jsonl.gz
...
Quick Example
metadata.json:
{
"schema_version": "llm-training-data/v1",
"dataset": {
"name": "my-sft-dataset",
"description": "Example SFT dataset"
},
"files": [
{
"split": "train",
"objective": "sft",
"shards": [
{ "path": "data/sft-train-00000.jsonl.gz", "compression": "gzip" }
]
}
]
}
SFT record (one line in the JSONL file):
{
"id": "example-001",
"messages": [
{ "role": "user", "content": [{ "type": "text", "text": "What is 2+2?" }] },
{ "role": "assistant", "content": [{ "type": "text", "text": "2+2 equals 4." }] }
]
}
Training Objectives
| Objective | Description | Key Requirement |
|---|---|---|
sft | Supervised fine-tuning | At least one assistant message |
preference | DPO/RLHF preference data | Chosen and rejected candidates |
rft | Reinforcement fine-tuning | Reference object with grading criteria |
cpt | Continued pretraining | Document-only, no assistant messages |
Integration with LLM Data Card
LLM Training Data is designed to work alongside LLM Data Card for complete dataset documentation:
- LLM Data Card documents rights, provenance, safety, and access information
- LLM Training Data specifies the actual data format and structure
You can reference a Data Card from your training data:
{
"schema_version": "llm-training-data/v1",
"data_card": {
"uri": "https://datapass.meetkai.ai/registry/my-dataset/1.0.0/card.json"
},
"dataset": { ... },
"files": [ ... ]
}
Next Steps
- Quickstart - Create your first dataset in minutes
- Metadata Reference - Full metadata.json specification
- SFT Records - Supervised fine-tuning format
- Examples - Complete working examples