Minimal SFT Example

Minimal SFT Example

This example shows the simplest valid SFT dataset.

File Structure

minimal-sft/
  metadata.json
  data/
    train.jsonl

metadata.json

{
  "schema_version": "llm-training-data/v1",
  "dataset": {
    "name": "minimal-sft-example",
    "description": "A minimal example dataset for supervised fine-tuning"
  },
  "files": [
    {
      "split": "train",
      "objective": "sft",
      "shards": [
        { "path": "data/train.jsonl" }
      ]
    }
  ]
}

data/train.jsonl

Each line is one record:

{"id":"001","messages":[{"role":"user","content":[{"type":"text","text":"What is the capital of France?"}]},{"role":"assistant","content":[{"type":"text","text":"The capital of France is Paris."}]}]}
{"id":"002","messages":[{"role":"user","content":[{"type":"text","text":"What is 2 + 2?"}]},{"role":"assistant","content":[{"type":"text","text":"2 + 2 equals 4."}]}]}
{"id":"003","messages":[{"role":"user","content":[{"type":"text","text":"Name three colors."}]},{"role":"assistant","content":[{"type":"text","text":"Three colors are red, blue, and green."}]}]}

Pretty-Printed Record

For readability, here’s one record expanded:

{
  "id": "001",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "What is the capital of France?" }
      ]
    },
    {
      "role": "assistant",
      "content": [
        { "type": "text", "text": "The capital of France is Paris." }
      ]
    }
  ]
}

Adding a System Message

With a default system message:

{
  "schema_version": "llm-training-data/v1",
  "dataset": {
    "name": "minimal-sft-with-system"
  },
  "defaults": {
    "system": [
      { "type": "text", "text": "You are a helpful assistant. Be concise and accurate." }
    ]
  },
  "files": [
    {
      "split": "train",
      "objective": "sft",
      "shards": [
        { "path": "data/train.jsonl" }
      ]
    }
  ]
}

With Compression

Using gzip compression:

{
  "schema_version": "llm-training-data/v1",
  "dataset": {
    "name": "minimal-sft-compressed"
  },
  "files": [
    {
      "split": "train",
      "objective": "sft",
      "shards": [
        {
          "path": "data/train.jsonl.gz",
          "compression": "gzip",
          "record_count": 3
        }
      ]
    }
  ]
}

Validation

Validate with the DataPass Validator:

  1. Select “LLM Training Data” schema
  2. Paste your metadata.json
  3. Check for any errors

Next Steps