Shards

Shards

Shards are the JSONL files containing your training records. The files array in metadata.json organizes shards by split and training objective.

File Structure

{
  "files": [
    {
      "split": "train",
      "objective": "sft",
      "shards": [
        { "path": "data/sft-train-00000.jsonl.gz", "compression": "gzip" },
        { "path": "data/sft-train-00001.jsonl.gz", "compression": "gzip" }
      ]
    }
  ]
}

File Entry Fields

FieldRequiredDescription
splitYestrain, validation, or test
objectiveYessft, preference, rft, or cpt
formatNoAlways jsonl
toolset_idNoToolset for this file
shardsYesArray of shard configurations
notesNoAdditional notes

Shard Fields

FieldRequiredDescription
pathYesRelative path to the shard file
compressionNonone, gzip, or zstd
record_countNoNumber of records in shard
sha256NoSHA-256 hash of the file

Compression

None (Uncompressed)

{ "path": "data/train.jsonl" }

Gzip

{ "path": "data/train.jsonl.gz", "compression": "gzip" }

Zstandard (zstd)

{ "path": "data/train.jsonl.zst", "compression": "zstd" }

Sharding Strategies

By Record Count

Split at fixed record counts (e.g., 50,000 records per shard):

data/train-00000.jsonl.gz  # records 0-49,999
data/train-00001.jsonl.gz  # records 50,000-99,999
data/train-00002.jsonl.gz  # records 100,000-149,999

By File Size

Split at target file sizes (e.g., 100MB per shard):

data/train-00000.jsonl.gz  # ~100MB
data/train-00001.jsonl.gz  # ~100MB

By Content Type

Split by data characteristics:

data/train-code-00000.jsonl.gz
data/train-math-00000.jsonl.gz
data/train-general-00000.jsonl.gz

Multiple Objectives

Separate files for different training objectives:

{
  "files": [
    {
      "split": "train",
      "objective": "sft",
      "shards": [
        { "path": "data/sft-train.jsonl.gz", "compression": "gzip" }
      ]
    },
    {
      "split": "train",
      "objective": "preference",
      "shards": [
        { "path": "data/pref-train.jsonl.gz", "compression": "gzip" }
      ]
    },
    {
      "split": "validation",
      "objective": "sft",
      "shards": [
        { "path": "data/sft-val.jsonl.gz", "compression": "gzip" }
      ]
    }
  ]
}

Checksums

Include SHA-256 hashes for integrity:

{
  "shards": [
    {
      "path": "data/train-00000.jsonl.gz",
      "compression": "gzip",
      "record_count": 50000,
      "sha256": "a1b2c3d4e5f6..."
    }
  ]
}

Generate checksums:

sha256sum data/train-00000.jsonl.gz

Best Practices

  1. Consistent shard sizes - Keep shards roughly equal for parallel processing
  2. Meaningful names - Include split and sequence in filenames
  3. Use compression - Save storage and bandwidth
  4. Include record counts - Enables progress tracking
  5. Add checksums - Verify integrity after transfer