Quickstart

Quickstart

This guide walks you through creating a minimal LLM Training Data dataset for supervised fine-tuning (SFT).

Prerequisites

  • A text editor
  • Basic familiarity with JSON/JSONL formats

Step 1: Create the Directory Structure

mkdir -p my-dataset/data
cd my-dataset

Step 2: Create metadata.json

Create a metadata.json file in your dataset root:

{
  "schema_version": "llm-training-data/v1",
  "dataset": {
    "name": "my-first-dataset",
    "description": "A minimal SFT dataset for demonstration",
    "license": "MIT"
  },
  "files": [
    {
      "split": "train",
      "objective": "sft",
      "shards": [
        { "path": "data/train.jsonl" }
      ]
    }
  ]
}

Step 3: Create Training Data

Create data/train.jsonl with one record per line:

{"id":"001","messages":[{"role":"user","content":[{"type":"text","text":"What is the capital of France?"}]},{"role":"assistant","content":[{"type":"text","text":"The capital of France is Paris."}]}]}
{"id":"002","messages":[{"role":"user","content":[{"type":"text","text":"Explain photosynthesis briefly."}]},{"role":"assistant","content":[{"type":"text","text":"Photosynthesis is the process by which plants convert sunlight, water, and carbon dioxide into glucose and oxygen."}]}]}
{"id":"003","messages":[{"role":"user","content":[{"type":"text","text":"Write a haiku about coding."}]},{"role":"assistant","content":[{"type":"text","text":"Lines of logic flow\nBugs emerge from the shadows\nDebug, compile, run"}]}]}

Step 4: Validate Your Dataset

Use the DataPass Validator to validate your metadata.json against the LLM Training Data schema.

Final Structure

Your dataset should look like this:

my-dataset/
  metadata.json
  data/
    train.jsonl

Next Steps

Common Patterns

Adding a Default System Message

{
  "schema_version": "llm-training-data/v1",
  "dataset": { ... },
  "defaults": {
    "system": [
      { "type": "text", "text": "You are a helpful assistant." }
    ]
  },
  "files": [ ... ]
}

Using Compression

For large datasets, use gzip or zstd compression:

{
  "files": [
    {
      "split": "train",
      "objective": "sft",
      "shards": [
        { "path": "data/train-00000.jsonl.gz", "compression": "gzip" },
        { "path": "data/train-00001.jsonl.gz", "compression": "gzip" }
      ]
    }
  ]
}

Adding Record Counts

Include record counts for validation and progress tracking:

{
  "shards": [
    {
      "path": "data/train-00000.jsonl.gz",
      "compression": "gzip",
      "record_count": 10000,
      "sha256": "abc123..."
    }
  ]
}