Large Scale Datasets

Large Scale Datasets

This guide covers strategies for organizing datasets with millions or billions of records.

Sharding Strategies

Fixed Record Count

Split at consistent record counts:

{
  "files": [
    {
      "split": "train",
      "objective": "sft",
      "shards": [
        { "path": "data/train-00000.jsonl.zst", "compression": "zstd", "record_count": 100000 },
        { "path": "data/train-00001.jsonl.zst", "compression": "zstd", "record_count": 100000 },
        { "path": "data/train-00002.jsonl.zst", "compression": "zstd", "record_count": 100000 }
      ]
    }
  ]
}

Benefits:

  • Predictable shard sizes
  • Easy parallel processing
  • Simple progress tracking

Fixed File Size

Split at target file sizes:

{
  "shards": [
    { "path": "data/train-00000.jsonl.zst", "compression": "zstd" },
    { "path": "data/train-00001.jsonl.zst", "compression": "zstd" }
  ]
}

Benefits:

  • Consistent I/O patterns
  • Better storage planning

By Domain/Topic

Split by content type:

{
  "files": [
    {
      "split": "train",
      "objective": "sft",
      "shards": [
        { "path": "data/train-code.jsonl.zst", "compression": "zstd" },
        { "path": "data/train-math.jsonl.zst", "compression": "zstd" },
        { "path": "data/train-writing.jsonl.zst", "compression": "zstd" }
      ],
      "notes": "Shards organized by domain for sampling control"
    }
  ]
}

Benefits:

  • Control domain mixing
  • Easy subset selection

Compression

{ "path": "data/train.jsonl.zst", "compression": "zstd" }
  • Better compression ratio than gzip
  • Faster decompression
  • Supports streaming

Gzip

{ "path": "data/train.jsonl.gz", "compression": "gzip" }
  • Universal compatibility
  • Good compression

Parallel Processing

Design for parallel reads:

# Good: Independent shards
data/train-00000.jsonl.zst
data/train-00001.jsonl.zst
data/train-00002.jsonl.zst

# Process in parallel
worker-0 → train-00000.jsonl.zst
worker-1 → train-00001.jsonl.zst
worker-2 → train-00002.jsonl.zst

Checksums

Always include checksums for large datasets:

{
  "shards": [
    {
      "path": "data/train-00000.jsonl.zst",
      "compression": "zstd",
      "record_count": 100000,
      "sha256": "a1b2c3d4..."
    }
  ]
}

Verify after transfer:

sha256sum -c checksums.txt

Asset Deduplication

For multimodal datasets, deduplicate media:

Without deduplication:

  • 1M records × 3 images each = 3M image URIs repeated

With assets.jsonl:

  • 500K unique images in assets.jsonl
  • Records reference by asset_id
{
  "asset_index": { "path": "assets.jsonl", "format": "jsonl" }
}

Storage Estimation

Rough estimates for planning:

Content TypeSize per Record1M Records
Text-only SFT~2-5 KB2-5 GB
Multimodal (refs)~3-8 KB3-8 GB
Tool use~4-10 KB4-10 GB

Compressed sizes are typically 30-50% of raw.

Directory Structure

my-large-dataset/
  metadata.json
  assets.jsonl
  data/
    sft/
      train-00000.jsonl.zst
      train-00001.jsonl.zst
      ...
      train-00999.jsonl.zst
      val-00000.jsonl.zst
    preference/
      train-00000.jsonl.zst
      ...

Best Practices

  1. Use zstd - Better compression and speed
  2. Consistent shard sizes - 50-100K records or 100-500MB
  3. Include checksums - Essential for integrity
  4. Track record counts - Enable progress monitoring
  5. Deduplicate assets - Save storage for multimodal
  6. Organize by objective - Separate SFT, preference, etc.
  7. Document everything - Clear notes in metadata