LLM Data Card v1.0
A machine-readable data card for documenting datasets used to train and evaluate large language models
{
"schema_version": "llm-datacard/v1.0",
"core": {
"id": "my-dataset",
"version": "1.0.0",
"title": "My Dataset",
"summary": "A well-documented dataset for LLM training",
"maintainer": "Your Organization",
"contact": "data@meetkai.ai"
},
"data": {
"kind": "real",
"modalities": ["text"],
"languages": ["en", "ar"],
"size": { "examples": 100000 },
"domains": ["news", "social-media"],
"record_format": "plain-text"
},
"rights": { ... },
"provenance": { ... },
"access": { ... }
} Why DataPass?
A standard designed for real-world data submission and low-resource language stewardship
Rights Clarity
Clear licensing, commercial use status, and personal data characterization in a machine-readable format.
Provenance Tracking
Document source types, geography, collection windows, and data origins with structured fields.
Access Information
Standardized availability status with request instructions and artifact pointers.
Language Stewardship
BCP-47 tags, autonyms, scripts, writing direction, and code-switching metadata for low-resource languages.
Schema Validation
Strict JSON Schema with additionalProperties: false ensures consistent, predictable data cards.
Community Driven
PR-based contribution workflow for data cards, schemas, and translations.
Strict Schema for Reliable Data Cards
The LLM Data Card v1.0 schema uses additionalProperties: false throughout, ensuring every field is intentional and validated.
- Required sections: core, data, rights, provenance, access
- Optional sections: artifacts, processing, quality, synthetic, use, governance, safety, community, sources, stats, extensions
- Conditional rules: Smart validation that requires fields based on context
Conditional Validation Examples
-
data.kind = "synthetic"requiressynthetic.generation_method -
rights.contains_personal_data != "none"requiresrights.consent_mechanism -
access.availability = "restricted"requiresrequest_instructionsorurl -
data.structures = "translation-pairs"requireslanguages.length >= 2
Ready to document your dataset?
Use our tools to create, validate, and publish your data card.