Data Card Integration
LLM Training Data and LLM Data Card are designed to work together. This guide explains how to link them for complete dataset documentation.
Why Link Them?
| LLM Training Data | LLM Data Card |
|---|---|
| Data format specification | Dataset documentation |
| How to read the data | Rights, provenance, safety |
| Technical structure | Metadata for discovery |
Together, they provide:
- Complete documentation for auditors and users
- Machine-readable format for training pipelines
- Discoverability through the registry
Linking from Training Data to Data Card
External Reference
Point to a hosted Data Card:
{
"schema_version": "llm-training-data/v1",
"data_card": {
"uri": "https://datapass.meetkai.ai/registry/my-dataset/1.0.0/card.json",
"sha256": "abc123..."
},
"dataset": { ... },
"files": [ ... ]
}
Inline Embedding
Embed the Data Card directly:
{
"schema_version": "llm-training-data/v1",
"data_card": {
"inline": true,
"core": {
"id": "my-dataset",
"version": "1.0.0",
"title": "My Training Dataset",
"summary": "A dataset for SFT training",
"maintainer": "My Org",
"contact": "data@meetkai.ai"
},
"data": {
"kind": "synthetic",
"modalities": ["text"],
"languages": ["en"],
"size": { "examples": 100000 },
"domains": ["general"],
"record_format": "llm-training-data"
},
"rights": {
"license": "MIT",
"allows_commercial_use": true,
"contains_personal_data": "none"
},
"provenance": {
"source_types": ["synthetic-generation"]
},
"access": {
"availability": "public-download",
"url": "https://datapass.meetkai.ai/registry/my-dataset/1.0.0"
}
},
"dataset": { ... },
"files": [ ... ]
}
Linking from Data Card to Training Data
In the LLM Data Card, use record_format: "llm-training-data" and reference the metadata.json:
{
"schema_version": "llm-datacard/v1.0",
"core": { ... },
"data": {
"kind": "synthetic",
"modalities": ["text"],
"languages": ["en"],
"size": { "examples": 100000 },
"domains": ["general"],
"record_format": "llm-training-data"
},
"artifacts": {
"training_data_ref": {
"metadata_uri": "s3://bucket/my-dataset/metadata.json",
"metadata_sha256": "def456...",
"training_objective": "sft"
},
"files": [
{
"uri": "s3://bucket/my-dataset/metadata.json",
"size_bytes": 1234
}
]
},
"rights": { ... },
"provenance": { ... },
"access": { ... }
}
Complete Bundle Structure
A fully documented dataset might look like:
my-dataset/
card.json # LLM Data Card
metadata.json # LLM Training Data (references card.json)
assets.jsonl # Asset index
data/
sft-train-00000.jsonl.gz
sft-train-00001.jsonl.gz
card.json:
{
"schema_version": "llm-datacard/v1.0",
"core": { "id": "my-dataset", ... },
"data": { "record_format": "llm-training-data", ... },
"artifacts": {
"training_data_ref": {
"metadata_uri": "./metadata.json",
"training_objective": "sft"
}
},
...
}
metadata.json:
{
"schema_version": "llm-training-data/v1",
"data_card": { "uri": "./card.json" },
"dataset": { "name": "my-dataset" },
"files": [ ... ]
}
Validation
Validate both files:
- Validate
card.jsonagainst LLM Data Card schema - Validate
metadata.jsonagainst LLM Training Data schema - Verify cross-references resolve correctly
Best Practices
- Always link - Every training dataset should have a Data Card
- Keep in sync - Update both when making changes
- Use checksums - Verify integrity of references
- Co-locate files - Keep card.json and metadata.json together
- Version together - Release both as a unit