Changelog

This page documents all changes to the LLM Data Card schema, including new features, breaking changes, and migration guides.

Version 1.0 (Current)

Current

Release Date: December 2024

Initial Release

The first stable release of the LLM Data Card schema, establishing a comprehensive structure for documenting datasets used to train and evaluate LLMs.

Core Sections (Required)

SectionPurpose
coreDataset identity, versioning, and contact information
dataContent description, modalities, languages, and size
rightsLicensing, commercial use, and personal data handling
provenanceData origins, collection methods, and geography
accessAvailability, URLs, and access instructions

Optional Sections

SectionPurpose
syntheticRequired when data.kind is synthetic or hybrid
artifactsFile information, checksums, and splits
processingData cleaning, filtering, and transformation steps
qualityQuality metrics and assessment methodology
useIntended uses and out-of-scope applications
governanceReview status and documentation links
safetyRisk assessment and mitigations
communityCommunity engagement and contributions
sourcesDetailed source documentation with relationships
statsStatistical properties and distribution information
extensionsCustom metadata for organizations

Data Section Features

  • primary_languages - Specify which languages are primary vs. secondary in multilingual datasets
  • language_details - Detailed per-language breakdown with example counts and shares
  • structures - Document data structures like QA pairs, conversations, and translation pairs
  • task_types - Specify intended ML tasks (classification, generation, etc.)
  • has_human_annotations and label_types - Document human annotation details

Key Features

  • JSON Schema Draft 2020-12 for modern validation
  • BCP-47 language codes for standardized language identification
  • SPDX license identifiers for unambiguous licensing
  • Conditional requirements via if/then/else logic
  • Extensibility through the extensions object

Conditional Validation Rules

Smart if/then rules for better validation:

  • structures contains translation-pairs requires at least 2 languages
  • record_format is other requires record_format_notes
  • has_human_annotations is true requires label_types

Versioning Policy

Semantic Versioning

The schema follows Semantic Versioning 2.0.0:

Version PartWhen IncrementedExample
MAJOR (X.0.0)Breaking changes to required fieldsRemoving a required field
MINOR (0.X.0)New optional fields, non-breaking changesAdding processing section
PATCH (0.0.X)Documentation fixes, clarificationsFixing typos in descriptions

Compatibility Guarantees

Forwards Compatible

Cards valid in version N remain valid in version N+1 (within the same major version)

Graceful Degradation

New optional fields are ignored by older validators

Deprecation Process

When fields are deprecated:

  1. Field is marked deprecated in the schema with deprecated: true
  2. Validators emit warnings (not errors) for deprecated fields
  3. Documentation is updated with migration guidance
  4. After 2 minor versions, field may be removed in next major version

Schema Validation

Version Detection

Always check schema_version before validation:

const card = JSON.parse(cardJson);
const version = card.schema_version;

if (version === "llm-datacard/v1.0") {
  // Use v1.0 schema
}

Planned Features

Under Consideration

  • Lineage tracking - Document dataset derivation chains
  • Benchmark results - Standardized performance metrics
  • Update notifications - Webhook support for card changes
  • Card signing - Cryptographic verification of card authenticity
  • Federation - Cross-registry dataset discovery

How to Contribute

Have ideas for schema improvements?

  1. Check existing RFCs for similar proposals
  2. Open a discussion in the GitHub repository
  3. Submit a formal RFC if there’s community interest