Schema Overview

The LLM Data Card schema provides a standardized way to document datasets used for training, evaluating, and fine-tuning large language models. This page explains the design principles and structure of the schema.

Design Principles

1. Comprehensive Yet Minimal

The schema balances thoroughness with usability. Required fields capture essential information, while optional sections allow detailed documentation when needed.

Required Sections

core - Identity and contact - data - Content description - rights - Licensing - provenance - Origins - access - How to get it

Optional Sections

synthetic - AI generation details - artifacts - File information - processing - Normalization & filtering - quality - Measurements & known issues - use - Intended applications - governance - Review status - safety - Risk assessment - community - Stewardship & review - sources - Per-source breakdown - stats - Numeric statistics - extensions - Custom fields

2. Machine-Readable

Every field has a defined type and constraints. This enables:

Automated validation - Check cards against the schema
Programmatic access - Build tools that consume data cards
Registry search - Filter datasets by specific criteria
Compliance checking - Verify licensing and privacy requirements

3. Conditional Requirements

Some fields become required based on other values. For example:

If data.kind is synthetic, the synthetic section is required
If rights.contains_personal_data is not none, legal_basis is required
If access.availability is on-request, either request_instructions or url is required

This ensures cards contain appropriate detail without burdening simple datasets.

4. Standards-Based

The schema uses established standards where possible:

Data	Standard
Languages	BCP-47 language tags
Licenses	SPDX identifiers
Dates	ISO 8601 format
Schema	JSON Schema Draft 2020-12

5. Extensible

The extensions object allows custom metadata without breaking schema validation:

{
  "extensions": {
    "my-org": {
      "internal_id": "DS-2024-001",
      "review_committee": "data-governance"
    }
  }
}

Schema Structure

Section Hierarchy

DataCard
├── schema_version    (required)
├── core              (required)
├── data              (required)
├── rights            (required)
├── provenance        (required)
├── access            (required)
├── synthetic         (conditional)
├── artifacts         (optional)
├── processing        (optional)
├── quality           (optional)
├── use               (optional)
├── governance        (optional)
├── safety            (optional)
├── community         (optional)
├── sources           (optional)
├── stats             (optional)
└── extensions        (optional)

Required vs. Conditional

Section	Status	Trigger
`core`	Required	Always
`data`	Required	Always
`rights`	Required	Always
`provenance`	Required	Always
`access`	Required	Always
`synthetic`	Conditional	`data.kind` is `synthetic` or `hybrid`

JSON Schema Specification

The schema uses JSON Schema Draft 2020-12 with these features:

Type System

{
  "type": "string",
  "type": "number",
  "type": "boolean",
  "type": "array",
  "type": "object"
}

Enumerations

Many fields use enums to restrict values:

{
  "data.kind": {
    "enum": ["real", "synthetic", "hybrid"]
  },
  "access.availability": {
    "enum": ["public-download", "restricted", "on-request", "not-available"]
  }
}

Conditional Logic

The schema uses if/then/else for conditional requirements:

{
  "if": {
    "properties": {
      "data": {
        "properties": {
          "kind": { "enum": ["synthetic", "hybrid"] }
        }
      }
    }
  },
  "then": {
    "required": ["synthetic"]
  }
}

Versioning

Semantic Versioning

The schema follows Semantic Versioning:

MAJOR (1.x.x) - Breaking changes to required fields
MINOR (x.1.x) - New optional fields, non-breaking changes
PATCH (x.x.1) - Documentation fixes, clarifications

Current Version

The current schema version is 1.0. Specify it in your data card:

{
  "schema_version": "llm-datacard/v1.0"
}

Backwards Compatibility

New optional fields don’t break existing cards
Required field changes are versioned
Deprecated fields emit warnings, not errors

Validation

Using the Validator

Validate your data card against the schema:

Go to the Validator tool
Paste or upload your JSON
Review any errors or warnings

Programmatic Validation

Use Ajv for JSON Schema Draft 2020-12:

import Ajv from "ajv/dist/2020";
import addFormats from "ajv-formats";

const ajv = new Ajv({ allErrors: true });
addFormats(ajv);

const validate = ajv.compile(schema);
const valid = validate(dataCard);

Next Steps

Field Reference - Detailed field documentation
Examples - Complete data card examples
Conditional Rules - All if/then requirements
Changelog - Version history