Artifacts Section
The artifacts section provides machine-actionable metadata about the actual dataset files. This enables reproducible downloads with integrity verification through checksums.
Required Fields (when artifacts is present)
| Field | Type | Description |
|---|---|---|
files | array | List of dataset files with metadata |
Each file object requires:
| Field | Type | Description |
|---|---|---|
sha256 | string | SHA-256 checksum (64 hex characters) |
size_bytes | integer | File size in bytes |
split | string | Split name (e.g., “train”, “validation”, “test”) |
Plus either uri or path:
| Field | Type | Description |
|---|---|---|
uri | string (uri) | Resolvable URI for the file |
path | string | Relative file path (used with base_uri) |
Optional Fields
Top-level
| Field | Type | Description |
|---|---|---|
base_uri | string (uri) | Base URI for resolving relative paths |
packaging | enum | How the dataset is packaged |
data_root_notes | string | Notes on interpreting file paths |
Per-file
| Field | Type | Description |
|---|---|---|
record_count | integer | Approximate number of records in this file |
media_type | string | MIME type (e.g., “application/jsonl”) |
compression | enum | Compression format used |
format_notes | string | Notes about file format/schema |
Enum Values
packaging
| Value | Description |
|---|---|
single-file | Dataset in one file |
sharded | Dataset split into numbered shards |
directory | Dataset organized in directory structure |
other | Other packaging scheme |
compression
| Value | Description |
|---|---|
none | Uncompressed |
zip | ZIP archive |
tar | TAR archive |
gz | Gzip compression |
zst | Zstandard compression |
bz2 | Bzip2 compression |
xz | XZ compression |
other | Other compression |
Examples
Simple Single-File Dataset
{
"artifacts": {
"packaging": "single-file",
"files": [
{
"uri": "https://cdn.example.org/datasets/hausa-news.jsonl",
"sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"size_bytes": 52428800,
"split": "train",
"record_count": 50000,
"media_type": "application/jsonl"
}
]
}
}
Sharded Dataset with Base URI
{
"artifacts": {
"packaging": "sharded",
"base_uri": "https://cdn.example.org/datasets/multilingual-qa/v1/",
"files": [
{
"path": "train-00000-of-00004.jsonl.gz",
"sha256": "a1b2c3d4e5f6...",
"size_bytes": 268435456,
"split": "train",
"record_count": 125000,
"compression": "gz"
},
{
"path": "train-00001-of-00004.jsonl.gz",
"sha256": "b2c3d4e5f6a1...",
"size_bytes": 268435456,
"split": "train",
"record_count": 125000,
"compression": "gz"
},
{
"path": "validation.jsonl.gz",
"sha256": "c3d4e5f6a1b2...",
"size_bytes": 26843545,
"split": "validation",
"record_count": 10000,
"compression": "gz"
},
{
"path": "test.jsonl.gz",
"sha256": "d4e5f6a1b2c3...",
"size_bytes": 26843545,
"split": "test",
"record_count": 10000,
"compression": "gz"
}
]
}
}
Multiple Formats
{
"artifacts": {
"packaging": "directory",
"base_uri": "s3://example-datasets/hausa-corpus/v2/",
"data_root_notes": "Parquet files recommended for large-scale processing. JSONL provided for compatibility.",
"files": [
{
"path": "parquet/train.parquet",
"sha256": "...",
"size_bytes": 419430400,
"split": "train",
"media_type": "application/vnd.apache.parquet"
},
{
"path": "jsonl/train.jsonl.zst",
"sha256": "...",
"size_bytes": 524288000,
"split": "train",
"media_type": "application/jsonl",
"compression": "zst"
}
]
}
}
Field Details
sha256
SHA-256 checksums enable integrity verification:
# Generate checksum
sha256sum train.jsonl
# Output: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 train.jsonl
split
Use consistent split names across files:
train- Training datavalidationordev- Validation/development datatest- Test data- Custom names for additional splits (e.g.,
test-hard,train-small)
Split names must match those defined in data.splits if that field is populated.
uri vs path
Use uri when files have absolute URLs:
{
"uri": "https://huggingface.co/datasets/example/corpus/resolve/main/train.jsonl"
}
Use path with base_uri for cleaner multi-file specifications:
{
"base_uri": "https://cdn.example.org/v1/",
"files": [
{ "path": "train.jsonl", "..." },
{ "path": "test.jsonl", "..." }
]
}
// Resolved URLs:
// https://cdn.example.org/v1/train.jsonl
// https://cdn.example.org/v1/test.jsonl
Supported URI Schemes
https://- Standard web URLshttp://- Unencrypted web (not recommended)s3://- Amazon S3gs://- Google Cloud Storagehf://- Hugging Face Hubaz://- Azure Blob Storage
Connecting to Splits
If you define splits in data.splits, the split names in artifacts.files should correspond:
{
"data": {
"splits": [
{
"name": "train",
"share_of_examples": 0.8,
"selection_method": "random"
},
{
"name": "test",
"share_of_examples": 0.2,
"selection_method": "random"
}
]
},
"artifacts": {
"files": [
{ "split": "train", "..." },
{ "split": "test", "..." }
]
}
}
See Also
- Access Section - High-level availability status
- Data Section - Split definitions
- Processing Section - How data was processed