Quickstart
Quickstart
This guide walks you through creating a minimal LLM Training Data dataset for supervised fine-tuning (SFT).
Prerequisites
- A text editor
- Basic familiarity with JSON/JSONL formats
Step 1: Create the Directory Structure
mkdir -p my-dataset/data
cd my-dataset
Step 2: Create metadata.json
Create a metadata.json file in your dataset root:
{
"schema_version": "llm-training-data/v1",
"dataset": {
"name": "my-first-dataset",
"description": "A minimal SFT dataset for demonstration",
"license": "MIT"
},
"files": [
{
"split": "train",
"objective": "sft",
"shards": [
{ "path": "data/train.jsonl" }
]
}
]
}
Step 3: Create Training Data
Create data/train.jsonl with one record per line:
{"id":"001","messages":[{"role":"user","content":[{"type":"text","text":"What is the capital of France?"}]},{"role":"assistant","content":[{"type":"text","text":"The capital of France is Paris."}]}]}
{"id":"002","messages":[{"role":"user","content":[{"type":"text","text":"Explain photosynthesis briefly."}]},{"role":"assistant","content":[{"type":"text","text":"Photosynthesis is the process by which plants convert sunlight, water, and carbon dioxide into glucose and oxygen."}]}]}
{"id":"003","messages":[{"role":"user","content":[{"type":"text","text":"Write a haiku about coding."}]},{"role":"assistant","content":[{"type":"text","text":"Lines of logic flow\nBugs emerge from the shadows\nDebug, compile, run"}]}]}
Step 4: Validate Your Dataset
Use the DataPass Validator to validate your metadata.json against the LLM Training Data schema.
Final Structure
Your dataset should look like this:
my-dataset/
metadata.json
data/
train.jsonl
Next Steps
- Add a system message in defaults
- Add multi-turn conversations
- Add tool use
- Create a preference dataset
- Link an LLM Data Card
Common Patterns
Adding a Default System Message
{
"schema_version": "llm-training-data/v1",
"dataset": { ... },
"defaults": {
"system": [
{ "type": "text", "text": "You are a helpful assistant." }
]
},
"files": [ ... ]
}
Using Compression
For large datasets, use gzip or zstd compression:
{
"files": [
{
"split": "train",
"objective": "sft",
"shards": [
{ "path": "data/train-00000.jsonl.gz", "compression": "gzip" },
{ "path": "data/train-00001.jsonl.gz", "compression": "gzip" }
]
}
]
}
Adding Record Counts
Include record counts for validation and progress tracking:
{
"shards": [
{
"path": "data/train-00000.jsonl.gz",
"compression": "gzip",
"record_count": 10000,
"sha256": "abc123..."
}
]
}