Files & Artifacts

This guide explains how to document the physical files that make up your dataset, including file listings, checksums for integrity verification, and train/test/validation splits.

The Artifacts Section

The artifacts section describes the actual files users will download:

{
  "artifacts": {
    "base_uri": "https://cdn.meetkai.ai/datasets/my-dataset/v1/",
    "total_size_bytes": 1073741824,
    "files": [
      {
        "path": "train.jsonl",
        "size_bytes": 800000000,
        "checksum": "sha256:a1b2c3d4..."
      }
    ]
  }
}

Base URI

The base_uri provides a prefix for all file paths:

{
  "artifacts": {
    "base_uri": "https://cdn.meetkai.ai/datasets/my-dataset/v1.0.0/"
  }
}

With this base URI, a file with path train.jsonl would be downloadable from: https://cdn.meetkai.ai/datasets/my-dataset/v1.0.0/train.jsonl

Alternative: Per-File URIs

If files are hosted in different locations, use uri on each file instead:

{
  "artifacts": {
    "files": [
      {
        "path": "train.jsonl",
        "uri": "https://storage.meetkai.ai/train.jsonl"
      },
      {
        "path": "test.jsonl",
        "uri": "https://cdn.other.com/datasets/test.jsonl"
      }
    ]
  }
}

File Listings

Basic File Entry

Each file in the files array can include:

FieldTypeDescription
pathstringFile path relative to base_uri
uristringFull URL (alternative to base_uri + path)
size_bytesintegerFile size in bytes
checksumstringIntegrity hash (see below)
formatstringFile format (jsonl, parquet, csv, etc.)
splitstringDataset split (train, test, validation)
languagestringLanguage code if language-specific
descriptionstringHuman-readable description

Complete File Example

{
  "artifacts": {
    "base_uri": "https://cdn.meetkai.ai/datasets/my-dataset/",
    "files": [
      {
        "path": "data/train.parquet",
        "size_bytes": 524288000,
        "checksum": "sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
        "format": "parquet",
        "split": "train",
        "description": "Training data partition"
      }
    ]
  }
}

Checksums

Checksums allow users to verify file integrity after download.

Checksum Format

Use the format algorithm:hash:

"checksum": "sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"

Supported Algorithms

AlgorithmUse CaseExample
sha256Recommended for most casessha256:e3b0c44...
sha512Higher security requirementsha512:cf83e135...
md5Legacy compatibility onlymd5:d41d8cd9...

Generating Checksums

Linux/macOS

bash sha256sum train.jsonl # or shasum -a 256 train.jsonl

Windows

powershell Get-FileHash train.jsonl -Algorithm SHA256

Verifying Downloads

Users can verify downloaded files:

# Linux
echo "e3b0c44298fc1c14...  train.jsonl" | sha256sum -c

# macOS
echo "e3b0c44298fc1c14...  train.jsonl" | shasum -a 256 -c

Dataset Splits

Document train/test/validation splits in two ways:

1. File-Level Splits

Tag individual files with their split:

{
  "artifacts": {
    "files": [
      {
        "path": "train.jsonl",
        "split": "train",
        "size_bytes": 800000000
      },
      {
        "path": "validation.jsonl",
        "split": "validation",
        "size_bytes": 100000000
      },
      {
        "path": "test.jsonl",
        "split": "test",
        "size_bytes": 100000000
      }
    ]
  }
}

2. Splits Summary

Provide an overview of split sizes:

{
  "artifacts": {
    "splits": {
      "train": {
        "examples": 80000,
        "size_bytes": 800000000
      },
      "validation": {
        "examples": 10000,
        "size_bytes": 100000000
      },
      "test": {
        "examples": 10000,
        "size_bytes": 100000000
      }
    }
  }
}

Split Fields

FieldTypeDescription
examplesintegerNumber of examples in split
size_bytesintegerTotal size of split files
tokensintegerToken count (for text data)

Total Size

Track the total dataset size:

{
  "artifacts": {
    "total_size_bytes": 1073741824
  }
}

This should equal the sum of all file sizes:

train.jsonl (800MB) + validation.jsonl (100MB) + test.jsonl (100MB) + metadata.json (1MB) = ~1GB

Archive Formats

For datasets distributed as archives:

{
  "artifacts": {
    "files": [
      {
        "path": "dataset-v1.0.tar.gz",
        "size_bytes": 524288000,
        "checksum": "sha256:abc123...",
        "format": "tar.gz",
        "description": "Compressed archive containing all data files"
      }
    ],
    "archive_contents": ["train.jsonl", "validation.jsonl", "test.jsonl", "metadata.json"]
  }
}

Multi-Part Files

For large datasets split across multiple files:

{
  "artifacts": {
    "files": [
      {
        "path": "train-00000-of-00010.parquet",
        "split": "train",
        "size_bytes": 100000000,
        "checksum": "sha256:aaa..."
      },
      {
        "path": "train-00001-of-00010.parquet",
        "split": "train",
        "size_bytes": 100000000,
        "checksum": "sha256:bbb..."
      }
      // ... more parts
    ]
  }
}

Language-Specific Files

For multilingual datasets with separate files per language:

{
  "artifacts": {
    "files": [
      {
        "path": "en/train.jsonl",
        "language": "en",
        "split": "train",
        "size_bytes": 500000000
      },
      {
        "path": "es/train.jsonl",
        "language": "es",
        "split": "train",
        "size_bytes": 400000000
      },
      {
        "path": "fr/train.jsonl",
        "language": "fr",
        "split": "train",
        "size_bytes": 300000000
      }
    ]
  }
}

Complete Example

Here’s a comprehensive artifacts section:

{
  "artifacts": {
    "base_uri": "https://cdn.meetkai.ai/datasets/multilingual-qa/v1.0.0/",
    "total_size_bytes": 2147483648,
    "splits": {
      "train": {
        "examples": 200000,
        "size_bytes": 1717986918
      },
      "validation": {
        "examples": 25000,
        "size_bytes": 214748365
      },
      "test": {
        "examples": 25000,
        "size_bytes": 214748365
      }
    },
    "files": [
      {
        "path": "train.parquet",
        "size_bytes": 1717986918,
        "checksum": "sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
        "format": "parquet",
        "split": "train",
        "description": "Training partition with all languages"
      },
      {
        "path": "validation.parquet",
        "size_bytes": 214748365,
        "checksum": "sha256:d7a8fbb307d7809469ca9abcb0082e4f8d5651e46d3cdb762d02d0bf37c9e592",
        "format": "parquet",
        "split": "validation"
      },
      {
        "path": "test.parquet",
        "size_bytes": 214748365,
        "checksum": "sha256:9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08",
        "format": "parquet",
        "split": "test"
      },
      {
        "path": "README.md",
        "size_bytes": 5000,
        "format": "markdown",
        "description": "Dataset documentation"
      },
      {
        "path": "schema.json",
        "size_bytes": 2500,
        "format": "json",
        "description": "Record schema definition"
      }
    ]
  }
}

Best Practices

  1. Always include checksums - SHA-256 is preferred 2. Use consistent paths - Prefer flat structure or logical directories 3. Document all files - Include metadata files, READMEs, schemas 4. Verify totals - Ensure total_size_bytes matches sum of file sizes 5. Tag splits consistently - Use train/validation/test naming 6. Include size_bytes - Helps users estimate download time

Conditional Requirements

The schema has a conditional rule for public downloads:

{
  "access": {
    "availability": "public-download"
  },
  "artifacts": {
    "base_uri": "https://cdn.meetkai.ai/data/" // This satisfies the requirement
  }
}