Large Scale Datasets
Large Scale Datasets
This guide covers strategies for organizing datasets with millions or billions of records.
Sharding Strategies
Fixed Record Count
Split at consistent record counts:
{
"files": [
{
"split": "train",
"objective": "sft",
"shards": [
{ "path": "data/train-00000.jsonl.zst", "compression": "zstd", "record_count": 100000 },
{ "path": "data/train-00001.jsonl.zst", "compression": "zstd", "record_count": 100000 },
{ "path": "data/train-00002.jsonl.zst", "compression": "zstd", "record_count": 100000 }
]
}
]
}
Benefits:
- Predictable shard sizes
- Easy parallel processing
- Simple progress tracking
Fixed File Size
Split at target file sizes:
{
"shards": [
{ "path": "data/train-00000.jsonl.zst", "compression": "zstd" },
{ "path": "data/train-00001.jsonl.zst", "compression": "zstd" }
]
}
Benefits:
- Consistent I/O patterns
- Better storage planning
By Domain/Topic
Split by content type:
{
"files": [
{
"split": "train",
"objective": "sft",
"shards": [
{ "path": "data/train-code.jsonl.zst", "compression": "zstd" },
{ "path": "data/train-math.jsonl.zst", "compression": "zstd" },
{ "path": "data/train-writing.jsonl.zst", "compression": "zstd" }
],
"notes": "Shards organized by domain for sampling control"
}
]
}
Benefits:
- Control domain mixing
- Easy subset selection
Compression
Zstd (Recommended)
{ "path": "data/train.jsonl.zst", "compression": "zstd" }
- Better compression ratio than gzip
- Faster decompression
- Supports streaming
Gzip
{ "path": "data/train.jsonl.gz", "compression": "gzip" }
- Universal compatibility
- Good compression
Parallel Processing
Design for parallel reads:
# Good: Independent shards
data/train-00000.jsonl.zst
data/train-00001.jsonl.zst
data/train-00002.jsonl.zst
# Process in parallel
worker-0 → train-00000.jsonl.zst
worker-1 → train-00001.jsonl.zst
worker-2 → train-00002.jsonl.zst
Checksums
Always include checksums for large datasets:
{
"shards": [
{
"path": "data/train-00000.jsonl.zst",
"compression": "zstd",
"record_count": 100000,
"sha256": "a1b2c3d4..."
}
]
}
Verify after transfer:
sha256sum -c checksums.txt
Asset Deduplication
For multimodal datasets, deduplicate media:
Without deduplication:
- 1M records × 3 images each = 3M image URIs repeated
With assets.jsonl:
- 500K unique images in assets.jsonl
- Records reference by asset_id
{
"asset_index": { "path": "assets.jsonl", "format": "jsonl" }
}
Storage Estimation
Rough estimates for planning:
| Content Type | Size per Record | 1M Records |
|---|---|---|
| Text-only SFT | ~2-5 KB | 2-5 GB |
| Multimodal (refs) | ~3-8 KB | 3-8 GB |
| Tool use | ~4-10 KB | 4-10 GB |
Compressed sizes are typically 30-50% of raw.
Directory Structure
my-large-dataset/
metadata.json
assets.jsonl
data/
sft/
train-00000.jsonl.zst
train-00001.jsonl.zst
...
train-00999.jsonl.zst
val-00000.jsonl.zst
preference/
train-00000.jsonl.zst
...
Best Practices
- Use zstd - Better compression and speed
- Consistent shard sizes - 50-100K records or 100-500MB
- Include checksums - Essential for integrity
- Track record counts - Enable progress monitoring
- Deduplicate assets - Save storage for multimodal
- Organize by objective - Separate SFT, preference, etc.
- Document everything - Clear notes in metadata