Schema Overview
The LLM Data Card schema provides a standardized way to document datasets used for training, evaluating, and fine-tuning large language models. This page explains the design principles and structure of the schema.
Design Principles
1. Comprehensive Yet Minimal
The schema balances thoroughness with usability. Required fields capture essential information, while optional sections allow detailed documentation when needed.
Required Sections
core- Identity and contact -data- Content description -rights- Licensing -provenance- Origins -access- How to get it
Optional Sections
synthetic- AI generation details -artifacts- File information -use- Intended applications -governance- Review status -safety- Risk assessment
2. Machine-Readable
Every field has a defined type and constraints. This enables:
- Automated validation - Check cards against the schema
- Programmatic access - Build tools that consume data cards
- Registry search - Filter datasets by specific criteria
- Compliance checking - Verify licensing and privacy requirements
3. Conditional Requirements
Some fields become required based on other values. For example:
- If
data.kindissynthetic, thesyntheticsection is required - If
rights.contains_personal_datais notnone,consent_mechanismis required - If
access.availabilityison-request,request_instructionsis required
This ensures cards contain appropriate detail without burdening simple datasets.
4. Standards-Based
The schema uses established standards where possible:
| Data | Standard |
|---|---|
| Languages | BCP-47 language tags |
| Licenses | SPDX identifiers |
| Dates | ISO 8601 format |
| Schema | JSON Schema Draft 2020-12 |
5. Extensible
The extensions object allows custom metadata without breaking schema validation:
{
"extensions": {
"my-org": {
"internal_id": "DS-2024-001",
"review_committee": "data-governance"
}
}
}
Schema Structure
Section Hierarchy
DataCard
├── schema_version (required)
├── core (required)
├── data (required)
├── rights (required)
├── provenance (required)
├── access (required)
├── synthetic (conditional)
├── artifacts (optional)
├── processing (optional)
├── quality (optional)
├── use (optional)
├── governance (optional)
├── safety (optional)
├── community (optional)
├── sources (optional)
├── stats (optional)
└── extensions (optional)
Required vs. Conditional
| Section | Status | Trigger |
|---|---|---|
core | Required | Always |
data | Required | Always |
rights | Required | Always |
provenance | Required | Always |
access | Required | Always |
synthetic | Conditional | data.kind is synthetic or hybrid |
JSON Schema Specification
The schema uses JSON Schema Draft 2020-12 with these features:
Type System
{
"type": "string",
"type": "number",
"type": "boolean",
"type": "array",
"type": "object"
}
Enumerations
Many fields use enums to restrict values:
{
"data.kind": {
"enum": ["real", "synthetic", "hybrid"]
},
"access.availability": {
"enum": ["public-download", "restricted", "on-request", "not-available"]
}
}
Conditional Logic
The schema uses if/then/else for conditional requirements:
{
"if": {
"properties": {
"data": {
"properties": {
"kind": { "enum": ["synthetic", "hybrid"] }
}
}
}
},
"then": {
"required": ["synthetic"]
}
}
Versioning
Semantic Versioning
The schema follows Semantic Versioning:
- MAJOR (1.x.x) - Breaking changes to required fields
- MINOR (x.1.x) - New optional fields, non-breaking changes
- PATCH (x.x.1) - Documentation fixes, clarifications
Current Version
The current schema version is 1.0. Specify it in your data card:
{
"schema_version": "llm-datacard/v1.0"
}
Backwards Compatibility
- New optional fields don’t break existing cards
- Required field changes are versioned
- Deprecated fields emit warnings, not errors
Validation
Using the Validator
Validate your data card against the schema:
- Go to the Validator tool
- Paste or upload your JSON
- Review any errors or warnings
Programmatic Validation
Use Ajv for JSON Schema Draft 2020-12:
import Ajv from "ajv/dist/2020";
import addFormats from "ajv-formats";
const ajv = new Ajv({ allErrors: true });
addFormats(ajv);
const validate = ajv.compile(schema);
const valid = validate(dataCard);
Next Steps
- Field Reference - Detailed field documentation
- Examples - Complete data card examples
- Conditional Rules - All if/then requirements
- Changelog - Version history