Artifacts Section

The artifacts section provides machine-actionable metadata about the actual dataset files. This enables reproducible downloads with integrity verification through checksums.

Required Fields (when artifacts is present)

FieldTypeDescription
filesarrayList of dataset files with metadata

Each file object requires:

FieldTypeDescription
sha256stringSHA-256 checksum (64 hex characters)
size_bytesintegerFile size in bytes
splitstringSplit name (e.g., “train”, “validation”, “test”)

Plus either uri or path:

FieldTypeDescription
uristring (uri)Resolvable URI for the file
pathstringRelative file path (used with base_uri)

Optional Fields

Top-level

FieldTypeDescription
base_uristring (uri)Base URI for resolving relative paths
packagingenumHow the dataset is packaged
data_root_notesstringNotes on interpreting file paths

Per-file

FieldTypeDescription
record_countintegerApproximate number of records in this file
media_typestringMIME type (e.g., “application/jsonl”)
compressionenumCompression format used
format_notesstringNotes about file format/schema

Enum Values

packaging

ValueDescription
single-fileDataset in one file
shardedDataset split into numbered shards
directoryDataset organized in directory structure
otherOther packaging scheme

compression

ValueDescription
noneUncompressed
zipZIP archive
tarTAR archive
gzGzip compression
zstZstandard compression
bz2Bzip2 compression
xzXZ compression
otherOther compression

Examples

Simple Single-File Dataset

{
  "artifacts": {
    "packaging": "single-file",
    "files": [
      {
        "uri": "https://cdn.example.org/datasets/hausa-news.jsonl",
        "sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
        "size_bytes": 52428800,
        "split": "train",
        "record_count": 50000,
        "media_type": "application/jsonl"
      }
    ]
  }
}

Sharded Dataset with Base URI

{
  "artifacts": {
    "packaging": "sharded",
    "base_uri": "https://cdn.example.org/datasets/multilingual-qa/v1/",
    "files": [
      {
        "path": "train-00000-of-00004.jsonl.gz",
        "sha256": "a1b2c3d4e5f6...",
        "size_bytes": 268435456,
        "split": "train",
        "record_count": 125000,
        "compression": "gz"
      },
      {
        "path": "train-00001-of-00004.jsonl.gz",
        "sha256": "b2c3d4e5f6a1...",
        "size_bytes": 268435456,
        "split": "train",
        "record_count": 125000,
        "compression": "gz"
      },
      {
        "path": "validation.jsonl.gz",
        "sha256": "c3d4e5f6a1b2...",
        "size_bytes": 26843545,
        "split": "validation",
        "record_count": 10000,
        "compression": "gz"
      },
      {
        "path": "test.jsonl.gz",
        "sha256": "d4e5f6a1b2c3...",
        "size_bytes": 26843545,
        "split": "test",
        "record_count": 10000,
        "compression": "gz"
      }
    ]
  }
}

Multiple Formats

{
  "artifacts": {
    "packaging": "directory",
    "base_uri": "s3://example-datasets/hausa-corpus/v2/",
    "data_root_notes": "Parquet files recommended for large-scale processing. JSONL provided for compatibility.",
    "files": [
      {
        "path": "parquet/train.parquet",
        "sha256": "...",
        "size_bytes": 419430400,
        "split": "train",
        "media_type": "application/vnd.apache.parquet"
      },
      {
        "path": "jsonl/train.jsonl.zst",
        "sha256": "...",
        "size_bytes": 524288000,
        "split": "train",
        "media_type": "application/jsonl",
        "compression": "zst"
      }
    ]
  }
}

Field Details

sha256

SHA-256 checksums enable integrity verification:

# Generate checksum
sha256sum train.jsonl
# Output: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  train.jsonl

split

Use consistent split names across files:

  • train - Training data
  • validation or dev - Validation/development data
  • test - Test data
  • Custom names for additional splits (e.g., test-hard, train-small)

Split names must match those defined in data.splits if that field is populated.

uri vs path

Use uri when files have absolute URLs:

{
  "uri": "https://huggingface.co/datasets/example/corpus/resolve/main/train.jsonl"
}

Use path with base_uri for cleaner multi-file specifications:

{
  "base_uri": "https://cdn.example.org/v1/",
  "files": [
    { "path": "train.jsonl", "..." },
    { "path": "test.jsonl", "..." }
  ]
}
// Resolved URLs:
// https://cdn.example.org/v1/train.jsonl
// https://cdn.example.org/v1/test.jsonl

Supported URI Schemes

  • https:// - Standard web URLs
  • http:// - Unencrypted web (not recommended)
  • s3:// - Amazon S3
  • gs:// - Google Cloud Storage
  • hf:// - Hugging Face Hub
  • az:// - Azure Blob Storage

Connecting to Splits

If you define splits in data.splits, the split names in artifacts.files should correspond:

{
  "data": {
    "splits": [
      {
        "name": "train",
        "share_of_examples": 0.8,
        "selection_method": "random"
      },
      {
        "name": "test",
        "share_of_examples": 0.2,
        "selection_method": "random"
      }
    ]
  },
  "artifacts": {
    "files": [
      { "split": "train", "..." },
      { "split": "test", "..." }
    ]
  }
}

See Also