Shards
Shards
Shards are the JSONL files containing your training records. The files array in metadata.json organizes shards by split and training objective.
File Structure
{
"files": [
{
"split": "train",
"objective": "sft",
"shards": [
{ "path": "data/sft-train-00000.jsonl.gz", "compression": "gzip" },
{ "path": "data/sft-train-00001.jsonl.gz", "compression": "gzip" }
]
}
]
}
File Entry Fields
| Field | Required | Description |
|---|---|---|
split | Yes | train, validation, or test |
objective | Yes | sft, preference, rft, or cpt |
format | No | Always jsonl |
toolset_id | No | Toolset for this file |
shards | Yes | Array of shard configurations |
notes | No | Additional notes |
Shard Fields
| Field | Required | Description |
|---|---|---|
path | Yes | Relative path to the shard file |
compression | No | none, gzip, or zstd |
record_count | No | Number of records in shard |
sha256 | No | SHA-256 hash of the file |
Compression
None (Uncompressed)
{ "path": "data/train.jsonl" }
Gzip
{ "path": "data/train.jsonl.gz", "compression": "gzip" }
Zstandard (zstd)
{ "path": "data/train.jsonl.zst", "compression": "zstd" }
Sharding Strategies
By Record Count
Split at fixed record counts (e.g., 50,000 records per shard):
data/train-00000.jsonl.gz # records 0-49,999
data/train-00001.jsonl.gz # records 50,000-99,999
data/train-00002.jsonl.gz # records 100,000-149,999
By File Size
Split at target file sizes (e.g., 100MB per shard):
data/train-00000.jsonl.gz # ~100MB
data/train-00001.jsonl.gz # ~100MB
By Content Type
Split by data characteristics:
data/train-code-00000.jsonl.gz
data/train-math-00000.jsonl.gz
data/train-general-00000.jsonl.gz
Multiple Objectives
Separate files for different training objectives:
{
"files": [
{
"split": "train",
"objective": "sft",
"shards": [
{ "path": "data/sft-train.jsonl.gz", "compression": "gzip" }
]
},
{
"split": "train",
"objective": "preference",
"shards": [
{ "path": "data/pref-train.jsonl.gz", "compression": "gzip" }
]
},
{
"split": "validation",
"objective": "sft",
"shards": [
{ "path": "data/sft-val.jsonl.gz", "compression": "gzip" }
]
}
]
}
Checksums
Include SHA-256 hashes for integrity:
{
"shards": [
{
"path": "data/train-00000.jsonl.gz",
"compression": "gzip",
"record_count": 50000,
"sha256": "a1b2c3d4e5f6..."
}
]
}
Generate checksums:
sha256sum data/train-00000.jsonl.gz
Best Practices
- Consistent shard sizes - Keep shards roughly equal for parallel processing
- Meaningful names - Include split and sequence in filenames
- Use compression - Save storage and bandwidth
- Include record counts - Enables progress tracking
- Add checksums - Verify integrity after transfer