Data Card Integration
Data Card Integration
LLM Training Data and LLM Data Card are designed to work together. This guide explains how to link them for complete dataset documentation.
Why Link Them?
| LLM Training Data | LLM Data Card |
|---|---|
| Data format specification | Dataset documentation |
| How to read the data | Rights, provenance, safety |
| Technical structure | Metadata for discovery |
Together, they provide:
- Complete documentation for auditors and users
- Machine-readable format for training pipelines
- Discoverability through the registry
Linking from Training Data to Data Card
External Reference
Point to a hosted Data Card:
{
"schema_version": "llm-training-data/v1",
"data_card": {
"uri": "https://datapass.meetkai.ai/registry/my-dataset/1.0.0/card.json",
"sha256": "abc123..."
},
"dataset": { ... },
"files": [ ... ]
}
Inline Embedding
Embed the Data Card directly:
{
"schema_version": "llm-training-data/v1",
"data_card": {
"inline": true,
"core": {
"id": "my-dataset",
"version": "1.0.0",
"title": "My Training Dataset",
"summary": "A dataset for SFT training",
"maintainer": "My Org",
"contact": "data@meetkai.ai"
},
"data": {
"kind": "synthetic",
"modalities": ["text"],
"languages": ["en"],
"size": { "examples": 100000 },
"domains": ["general"],
"record_format": "llm-training-data"
},
"rights": {
"license": "MIT",
"allows_commercial_use": true,
"contains_personal_data": "none"
},
"provenance": {
"source_types": ["synthetic-generation"]
},
"access": {
"availability": "public-download",
"url": "https://datapass.meetkai.ai/registry/my-dataset/1.0.0"
}
},
"dataset": { ... },
"files": [ ... ]
}
Linking from Data Card to Training Data
In the LLM Data Card, use record_format: "llm-training-data" and reference the metadata.json:
{
"schema_version": "llm-datacard/v1.0",
"core": { ... },
"data": {
"kind": "synthetic",
"modalities": ["text"],
"languages": ["en"],
"size": { "examples": 100000 },
"domains": ["general"],
"record_format": "llm-training-data"
},
"artifacts": {
"training_data_ref": {
"metadata_uri": "s3://bucket/my-dataset/metadata.json",
"metadata_sha256": "def456...",
"training_objective": "sft"
},
"files": [
{
"uri": "s3://bucket/my-dataset/metadata.json",
"size_bytes": 1234
}
]
},
"rights": { ... },
"provenance": { ... },
"access": { ... }
}
Complete Bundle Structure
A fully documented dataset might look like:
my-dataset/
card.json # LLM Data Card
metadata.json # LLM Training Data (references card.json)
assets.jsonl # Asset index
data/
sft-train-00000.jsonl.gz
sft-train-00001.jsonl.gz
card.json:
{
"schema_version": "llm-datacard/v1.0",
"core": { "id": "my-dataset", ... },
"data": { "record_format": "llm-training-data", ... },
"artifacts": {
"training_data_ref": {
"metadata_uri": "./metadata.json",
"training_objective": "sft"
}
},
...
}
metadata.json:
{
"schema_version": "llm-training-data/v1",
"data_card": { "uri": "./card.json" },
"dataset": { "name": "my-dataset" },
"files": [ ... ]
}
Validation
Validate both files:
- Validate
card.jsonagainst LLM Data Card schema - Validate
metadata.jsonagainst LLM Training Data schema - Verify cross-references resolve correctly
Best Practices
- Always link - Every training dataset should have a Data Card
- Keep in sync - Update both when making changes
- Use checksums - Verify integrity of references
- Co-locate files - Keep card.json and metadata.json together
- Version together - Release both as a unit