Changelog

This page documents all changes to the LLM Data Card schema, including new features, breaking changes, and migration guides.

Version 1.0 (Current)

Current

Release Date: December 2025

Initial Release

The first stable release of the LLM Data Card schema, establishing a comprehensive structure for documenting datasets used to train and evaluate LLMs.

Core Sections (Required)

Section	Purpose
`core`	Dataset identity, versioning, and contact information
`data`	Content description, modalities, languages, and size
`rights`	Licensing, commercial use, and personal data handling
`provenance`	Data origins, collection methods, and geography
`access`	Availability, URLs, and access instructions

Optional Sections

Section	Purpose
`synthetic`	Required when `data.kind` is `synthetic` or `hybrid`
`artifacts`	File information, checksums, and splits
`processing`	Data cleaning, filtering, and transformation steps
`quality`	Quality metrics and assessment methodology
`use`	Intended uses and out-of-scope applications
`governance`	Review status and documentation links
`safety`	Risk assessment and mitigations
`community`	Community engagement and contributions
`sources`	Detailed source documentation with relationships
`stats`	Statistical properties and distribution information
`extensions`	Custom metadata for organizations

Data Section Features

primary_languages - Specify which languages are primary vs. secondary in multilingual datasets
language_details - Detailed per-language breakdown with example counts and shares
structures - Document data structures like QA pairs, conversations, and translation pairs
task_types - Specify intended ML tasks (classification, generation, etc.)
has_human_annotations and label_types - Document human annotation details

Key Features

JSON Schema Draft 2020-12 for modern validation
BCP-47 language codes for standardized language identification
SPDX license identifiers for unambiguous licensing
Conditional requirements via if/then/else logic
Extensibility through the extensions object

Conditional Validation Rules

Smart if/then rules for better validation:

structures contains translation-pairs requires at least 2 languages
record_format is other requires record_format_notes
has_human_annotations is true requires label_types

Versioning Policy

Semantic Versioning

The schema follows Semantic Versioning 2.0.0:

Version Part	When Incremented	Example
MAJOR (X.0.0)	Breaking changes to required fields	Removing a required field
MINOR (0.X.0)	New optional fields, non-breaking changes	Adding `processing` section
PATCH (0.0.X)	Documentation fixes, clarifications	Fixing typos in descriptions

Compatibility Guarantees

Forwards Compatible

Cards valid in version N remain valid in version N+1 (within the same major version)

Graceful Degradation

New optional fields are ignored by older validators

Deprecation Process

When fields are deprecated:

Field is marked deprecated in the schema with deprecated: true
Validators emit warnings (not errors) for deprecated fields
Documentation is updated with migration guidance
After 2 minor versions, field may be removed in next major version

Schema Validation

Version Detection

Always check schema_version before validation:

const card = JSON.parse(cardJson);
const version = card.schema_version;

if (version === "llm-datacard/v1.0") {
  // Use v1.0 schema
}

Planned Features

Under Consideration

Lineage tracking - Document dataset derivation chains
Benchmark results - Standardized performance metrics
Update notifications - Webhook support for card changes
Card signing - Cryptographic verification of card authenticity
Federation - Cross-registry dataset discovery

How to Contribute

Have ideas for schema improvements?

Check existing RFCs for similar proposals
Open a discussion in the GitHub repository
Submit a formal RFC if there’s community interest

Schema Overview - Design principles and philosophy
Validation Tool - Test your cards against any version
Field Reference - Complete field documentation