Minimal SFT Example
Minimal SFT Example
This example shows the simplest valid SFT dataset.
File Structure
minimal-sft/
metadata.json
data/
train.jsonl
metadata.json
{
"schema_version": "llm-training-data/v1",
"dataset": {
"name": "minimal-sft-example",
"description": "A minimal example dataset for supervised fine-tuning"
},
"files": [
{
"split": "train",
"objective": "sft",
"shards": [
{ "path": "data/train.jsonl" }
]
}
]
}
data/train.jsonl
Each line is one record:
{"id":"001","messages":[{"role":"user","content":[{"type":"text","text":"What is the capital of France?"}]},{"role":"assistant","content":[{"type":"text","text":"The capital of France is Paris."}]}]}
{"id":"002","messages":[{"role":"user","content":[{"type":"text","text":"What is 2 + 2?"}]},{"role":"assistant","content":[{"type":"text","text":"2 + 2 equals 4."}]}]}
{"id":"003","messages":[{"role":"user","content":[{"type":"text","text":"Name three colors."}]},{"role":"assistant","content":[{"type":"text","text":"Three colors are red, blue, and green."}]}]}
Pretty-Printed Record
For readability, here’s one record expanded:
{
"id": "001",
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "What is the capital of France?" }
]
},
{
"role": "assistant",
"content": [
{ "type": "text", "text": "The capital of France is Paris." }
]
}
]
}
Adding a System Message
With a default system message:
{
"schema_version": "llm-training-data/v1",
"dataset": {
"name": "minimal-sft-with-system"
},
"defaults": {
"system": [
{ "type": "text", "text": "You are a helpful assistant. Be concise and accurate." }
]
},
"files": [
{
"split": "train",
"objective": "sft",
"shards": [
{ "path": "data/train.jsonl" }
]
}
]
}
With Compression
Using gzip compression:
{
"schema_version": "llm-training-data/v1",
"dataset": {
"name": "minimal-sft-compressed"
},
"files": [
{
"split": "train",
"objective": "sft",
"shards": [
{
"path": "data/train.jsonl.gz",
"compression": "gzip",
"record_count": 3
}
]
}
]
}
Validation
Validate with the DataPass Validator:
- Select “LLM Training Data” schema
- Paste your metadata.json
- Check for any errors
Next Steps
- Add multi-turn conversations
- Include reasoning traces
- Add tool use
- Link an LLM Data Card