Changelog
This page documents all changes to the LLM Data Card schema, including new features, breaking changes, and migration guides.
Version 1.0 (Current)
CurrentRelease Date: December 2024
Initial Release
The first stable release of the LLM Data Card schema, establishing a comprehensive structure for documenting datasets used to train and evaluate LLMs.
Core Sections (Required)
| Section | Purpose |
|---|---|
core | Dataset identity, versioning, and contact information |
data | Content description, modalities, languages, and size |
rights | Licensing, commercial use, and personal data handling |
provenance | Data origins, collection methods, and geography |
access | Availability, URLs, and access instructions |
Optional Sections
| Section | Purpose |
|---|---|
synthetic | Required when data.kind is synthetic or hybrid |
artifacts | File information, checksums, and splits |
processing | Data cleaning, filtering, and transformation steps |
quality | Quality metrics and assessment methodology |
use | Intended uses and out-of-scope applications |
governance | Review status and documentation links |
safety | Risk assessment and mitigations |
community | Community engagement and contributions |
sources | Detailed source documentation with relationships |
stats | Statistical properties and distribution information |
extensions | Custom metadata for organizations |
Data Section Features
primary_languages- Specify which languages are primary vs. secondary in multilingual datasetslanguage_details- Detailed per-language breakdown with example counts and sharesstructures- Document data structures like QA pairs, conversations, and translation pairstask_types- Specify intended ML tasks (classification, generation, etc.)has_human_annotationsandlabel_types- Document human annotation details
Key Features
- JSON Schema Draft 2020-12 for modern validation
- BCP-47 language codes for standardized language identification
- SPDX license identifiers for unambiguous licensing
- Conditional requirements via if/then/else logic
- Extensibility through the
extensionsobject
Conditional Validation Rules
Smart if/then rules for better validation:
structurescontainstranslation-pairsrequires at least 2 languagesrecord_formatisotherrequiresrecord_format_noteshas_human_annotationsistruerequireslabel_types
Versioning Policy
Semantic Versioning
The schema follows Semantic Versioning 2.0.0:
| Version Part | When Incremented | Example |
|---|---|---|
| MAJOR (X.0.0) | Breaking changes to required fields | Removing a required field |
| MINOR (0.X.0) | New optional fields, non-breaking changes | Adding processing section |
| PATCH (0.0.X) | Documentation fixes, clarifications | Fixing typos in descriptions |
Compatibility Guarantees
Forwards Compatible
Cards valid in version N remain valid in version N+1 (within the same major version)
Graceful Degradation
New optional fields are ignored by older validators
Deprecation Process
When fields are deprecated:
- Field is marked deprecated in the schema with
deprecated: true - Validators emit warnings (not errors) for deprecated fields
- Documentation is updated with migration guidance
- After 2 minor versions, field may be removed in next major version
Schema Validation
Version Detection
Always check schema_version before validation:
const card = JSON.parse(cardJson);
const version = card.schema_version;
if (version === "llm-datacard/v1.0") {
// Use v1.0 schema
}
Planned Features
Under Consideration
- Lineage tracking - Document dataset derivation chains
- Benchmark results - Standardized performance metrics
- Update notifications - Webhook support for card changes
- Card signing - Cryptographic verification of card authenticity
- Federation - Cross-registry dataset discovery
How to Contribute
Have ideas for schema improvements?
- Check existing RFCs for similar proposals
- Open a discussion in the GitHub repository
- Submit a formal RFC if there’s community interest
Related Resources
- Schema Overview - Design principles and philosophy
- Validation Tool - Test your cards against any version
- Field Reference - Complete field documentation