Artifacts Section

The artifacts section provides machine-actionable metadata about the actual dataset files. This enables reproducible downloads with integrity verification through checksums.

Required Fields (when artifacts is present)

Field	Type	Description
`files`	array	List of dataset files with metadata

Each file object requires:

Field	Type	Description
`sha256`	string	SHA-256 checksum (64 hex characters)
`size_bytes`	integer	File size in bytes
`split`	string	Split name (e.g., “train”, “validation”, “test”)

Plus either uri or path:

Field	Type	Description
`uri`	string (uri)	Resolvable URI for the file
`path`	string	Relative file path (used with `base_uri`)

Optional Fields

Top-level

Field	Type	Description
`base_uri`	string (uri)	Base URI for resolving relative paths
`packaging`	enum	How the dataset is packaged
`data_root_notes`	string	Notes on interpreting file paths

Per-file

Field	Type	Description
`record_count`	integer	Approximate number of records in this file
`media_type`	string	MIME type (e.g., “application/jsonl”)
`compression`	enum	Compression format used
`format_notes`	string	Notes about file format/schema

Enum Values

packaging

Value	Description
`single-file`	Dataset in one file
`sharded`	Dataset split into numbered shards
`directory`	Dataset organized in directory structure
`other`	Other packaging scheme

compression

Value	Description
`none`	Uncompressed
`zip`	ZIP archive
`tar`	TAR archive
`gz`	Gzip compression
`zst`	Zstandard compression
`bz2`	Bzip2 compression
`xz`	XZ compression
`other`	Other compression

Examples

Simple Single-File Dataset

{
  "artifacts": {
    "packaging": "single-file",
    "files": [
      {
        "uri": "https://cdn.example.org/datasets/hausa-news.jsonl",
        "sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
        "size_bytes": 52428800,
        "split": "train",
        "record_count": 50000,
        "media_type": "application/jsonl"
      }
    ]
  }
}

Sharded Dataset with Base URI

{
  "artifacts": {
    "packaging": "sharded",
    "base_uri": "https://cdn.example.org/datasets/multilingual-qa/v1/",
    "files": [
      {
        "path": "train-00000-of-00004.jsonl.gz",
        "sha256": "a1b2c3d4e5f6...",
        "size_bytes": 268435456,
        "split": "train",
        "record_count": 125000,
        "compression": "gz"
      },
      {
        "path": "train-00001-of-00004.jsonl.gz",
        "sha256": "b2c3d4e5f6a1...",
        "size_bytes": 268435456,
        "split": "train",
        "record_count": 125000,
        "compression": "gz"
      },
      {
        "path": "validation.jsonl.gz",
        "sha256": "c3d4e5f6a1b2...",
        "size_bytes": 26843545,
        "split": "validation",
        "record_count": 10000,
        "compression": "gz"
      },
      {
        "path": "test.jsonl.gz",
        "sha256": "d4e5f6a1b2c3...",
        "size_bytes": 26843545,
        "split": "test",
        "record_count": 10000,
        "compression": "gz"
      }
    ]
  }
}

Multiple Formats

{
  "artifacts": {
    "packaging": "directory",
    "base_uri": "s3://example-datasets/hausa-corpus/v2/",
    "data_root_notes": "Parquet files recommended for large-scale processing. JSONL provided for compatibility.",
    "files": [
      {
        "path": "parquet/train.parquet",
        "sha256": "...",
        "size_bytes": 419430400,
        "split": "train",
        "media_type": "application/vnd.apache.parquet"
      },
      {
        "path": "jsonl/train.jsonl.zst",
        "sha256": "...",
        "size_bytes": 524288000,
        "split": "train",
        "media_type": "application/jsonl",
        "compression": "zst"
      }
    ]
  }
}

Field Details

sha256

SHA-256 checksums enable integrity verification:

# Generate checksum
sha256sum train.jsonl
# Output: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  train.jsonl

split

Use consistent split names across files:

train - Training data
validation or dev - Validation/development data
test - Test data
Custom names for additional splits (e.g., test-hard, train-small)

Split names must match those defined in data.splits if that field is populated.

uri vs path

Use uri when files have absolute URLs:

{
  "uri": "https://huggingface.co/datasets/example/corpus/resolve/main/train.jsonl"
}

Use path with base_uri for cleaner multi-file specifications:

{
  "base_uri": "https://cdn.example.org/v1/",
  "files": [
    { "path": "train.jsonl", "..." },
    { "path": "test.jsonl", "..." }
  ]
}
// Resolved URLs:
// https://cdn.example.org/v1/train.jsonl
// https://cdn.example.org/v1/test.jsonl

Supported URI Schemes

https:// - Standard web URLs
http:// - Unencrypted web (not recommended)
s3:// - Amazon S3
gs:// - Google Cloud Storage
hf:// - Hugging Face Hub
az:// - Azure Blob Storage

Connecting to Splits

If you define splits in data.splits, the split names in artifacts.files should correspond:

{
  "data": {
    "splits": [
      {
        "name": "train",
        "share_of_examples": 0.8,
        "selection_method": "random"
      },
      {
        "name": "test",
        "share_of_examples": 0.2,
        "selection_method": "random"
      }
    ]
  },
  "artifacts": {
    "files": [
      { "split": "train", "..." },
      { "split": "test", "..." }
    ]
  }
}