Files & Artifacts
This guide explains how to document the physical files that make up your dataset, including file listings, checksums for integrity verification, and train/test/validation splits.
The Artifacts Section
The artifacts section describes the actual files users will download:
{
"artifacts": {
"base_uri": "https://cdn.meetkai.ai/datasets/my-dataset/v1/",
"total_size_bytes": 1073741824,
"files": [
{
"path": "train.jsonl",
"size_bytes": 800000000,
"checksum": "sha256:a1b2c3d4..."
}
]
}
}
Base URI
The base_uri provides a prefix for all file paths:
{
"artifacts": {
"base_uri": "https://cdn.meetkai.ai/datasets/my-dataset/v1.0.0/"
}
}
With this base URI, a file with path train.jsonl would be downloadable from:
https://cdn.meetkai.ai/datasets/my-dataset/v1.0.0/train.jsonl
Alternative: Per-File URIs
If files are hosted in different locations, use uri on each file instead:
{
"artifacts": {
"files": [
{
"path": "train.jsonl",
"uri": "https://storage.meetkai.ai/train.jsonl"
},
{
"path": "test.jsonl",
"uri": "https://cdn.other.com/datasets/test.jsonl"
}
]
}
}
File Listings
Basic File Entry
Each file in the files array can include:
| Field | Type | Description |
|---|---|---|
path | string | File path relative to base_uri |
uri | string | Full URL (alternative to base_uri + path) |
size_bytes | integer | File size in bytes |
checksum | string | Integrity hash (see below) |
format | string | File format (jsonl, parquet, csv, etc.) |
split | string | Dataset split (train, test, validation) |
language | string | Language code if language-specific |
description | string | Human-readable description |
Complete File Example
{
"artifacts": {
"base_uri": "https://cdn.meetkai.ai/datasets/my-dataset/",
"files": [
{
"path": "data/train.parquet",
"size_bytes": 524288000,
"checksum": "sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"format": "parquet",
"split": "train",
"description": "Training data partition"
}
]
}
}
Checksums
Checksums allow users to verify file integrity after download.
Checksum Format
Use the format algorithm:hash:
"checksum": "sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
Supported Algorithms
| Algorithm | Use Case | Example |
|---|---|---|
sha256 | Recommended for most cases | sha256:e3b0c44... |
sha512 | Higher security requirement | sha512:cf83e135... |
md5 | Legacy compatibility only | md5:d41d8cd9... |
Generating Checksums
Linux/macOS
bash sha256sum train.jsonl # or shasum -a 256 train.jsonl
Windows
powershell Get-FileHash train.jsonl -Algorithm SHA256
Verifying Downloads
Users can verify downloaded files:
# Linux
echo "e3b0c44298fc1c14... train.jsonl" | sha256sum -c
# macOS
echo "e3b0c44298fc1c14... train.jsonl" | shasum -a 256 -c
Dataset Splits
Document train/test/validation splits in two ways:
1. File-Level Splits
Tag individual files with their split:
{
"artifacts": {
"files": [
{
"path": "train.jsonl",
"split": "train",
"size_bytes": 800000000
},
{
"path": "validation.jsonl",
"split": "validation",
"size_bytes": 100000000
},
{
"path": "test.jsonl",
"split": "test",
"size_bytes": 100000000
}
]
}
}
2. Splits Summary
Provide an overview of split sizes:
{
"artifacts": {
"splits": {
"train": {
"examples": 80000,
"size_bytes": 800000000
},
"validation": {
"examples": 10000,
"size_bytes": 100000000
},
"test": {
"examples": 10000,
"size_bytes": 100000000
}
}
}
}
Split Fields
| Field | Type | Description |
|---|---|---|
examples | integer | Number of examples in split |
size_bytes | integer | Total size of split files |
tokens | integer | Token count (for text data) |
Total Size
Track the total dataset size:
{
"artifacts": {
"total_size_bytes": 1073741824
}
}
This should equal the sum of all file sizes:
train.jsonl (800MB) + validation.jsonl (100MB) + test.jsonl (100MB) + metadata.json (1MB) = ~1GB
Archive Formats
For datasets distributed as archives:
{
"artifacts": {
"files": [
{
"path": "dataset-v1.0.tar.gz",
"size_bytes": 524288000,
"checksum": "sha256:abc123...",
"format": "tar.gz",
"description": "Compressed archive containing all data files"
}
],
"archive_contents": ["train.jsonl", "validation.jsonl", "test.jsonl", "metadata.json"]
}
}
Multi-Part Files
For large datasets split across multiple files:
{
"artifacts": {
"files": [
{
"path": "train-00000-of-00010.parquet",
"split": "train",
"size_bytes": 100000000,
"checksum": "sha256:aaa..."
},
{
"path": "train-00001-of-00010.parquet",
"split": "train",
"size_bytes": 100000000,
"checksum": "sha256:bbb..."
}
// ... more parts
]
}
}
Language-Specific Files
For multilingual datasets with separate files per language:
{
"artifacts": {
"files": [
{
"path": "en/train.jsonl",
"language": "en",
"split": "train",
"size_bytes": 500000000
},
{
"path": "es/train.jsonl",
"language": "es",
"split": "train",
"size_bytes": 400000000
},
{
"path": "fr/train.jsonl",
"language": "fr",
"split": "train",
"size_bytes": 300000000
}
]
}
}
Complete Example
Here’s a comprehensive artifacts section:
{
"artifacts": {
"base_uri": "https://cdn.meetkai.ai/datasets/multilingual-qa/v1.0.0/",
"total_size_bytes": 2147483648,
"splits": {
"train": {
"examples": 200000,
"size_bytes": 1717986918
},
"validation": {
"examples": 25000,
"size_bytes": 214748365
},
"test": {
"examples": 25000,
"size_bytes": 214748365
}
},
"files": [
{
"path": "train.parquet",
"size_bytes": 1717986918,
"checksum": "sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"format": "parquet",
"split": "train",
"description": "Training partition with all languages"
},
{
"path": "validation.parquet",
"size_bytes": 214748365,
"checksum": "sha256:d7a8fbb307d7809469ca9abcb0082e4f8d5651e46d3cdb762d02d0bf37c9e592",
"format": "parquet",
"split": "validation"
},
{
"path": "test.parquet",
"size_bytes": 214748365,
"checksum": "sha256:9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08",
"format": "parquet",
"split": "test"
},
{
"path": "README.md",
"size_bytes": 5000,
"format": "markdown",
"description": "Dataset documentation"
},
{
"path": "schema.json",
"size_bytes": 2500,
"format": "json",
"description": "Record schema definition"
}
]
}
}
Best Practices
- Always include checksums - SHA-256 is preferred 2. Use consistent paths - Prefer flat structure or logical
directories 3. Document all files - Include metadata files, READMEs, schemas 4. Verify totals - Ensure
total_size_bytesmatches sum of file sizes 5. Tag splits consistently - Use train/validation/test naming 6. Include size_bytes - Helps users estimate download time
Conditional Requirements
The schema has a conditional rule for public downloads:
{
"access": {
"availability": "public-download"
},
"artifacts": {
"base_uri": "https://cdn.meetkai.ai/data/" // This satisfies the requirement
}
}
Related Documentation
- Artifacts Reference - All artifacts fields
- Access Reference - Availability options
- Minimal Example - Basic file documentation
- Conditional Rules - All if/then requirements