Schema Overview

The LLM Data Card schema provides a standardized way to document datasets used for training, evaluating, and fine-tuning large language models. This page explains the design principles and structure of the schema.

Design Principles

1. Comprehensive Yet Minimal

The schema balances thoroughness with usability. Required fields capture essential information, while optional sections allow detailed documentation when needed.

Required Sections

  • core - Identity and contact - data - Content description - rights - Licensing - provenance - Origins - access - How to get it

Optional Sections

  • synthetic - AI generation details - artifacts - File information - use - Intended applications - governance - Review status - safety - Risk assessment

2. Machine-Readable

Every field has a defined type and constraints. This enables:

  • Automated validation - Check cards against the schema
  • Programmatic access - Build tools that consume data cards
  • Registry search - Filter datasets by specific criteria
  • Compliance checking - Verify licensing and privacy requirements

3. Conditional Requirements

Some fields become required based on other values. For example:

  • If data.kind is synthetic, the synthetic section is required
  • If rights.contains_personal_data is not none, consent_mechanism is required
  • If access.availability is on-request, request_instructions is required

This ensures cards contain appropriate detail without burdening simple datasets.

4. Standards-Based

The schema uses established standards where possible:

DataStandard
LanguagesBCP-47 language tags
LicensesSPDX identifiers
DatesISO 8601 format
SchemaJSON Schema Draft 2020-12

5. Extensible

The extensions object allows custom metadata without breaking schema validation:

{
  "extensions": {
    "my-org": {
      "internal_id": "DS-2024-001",
      "review_committee": "data-governance"
    }
  }
}

Schema Structure

Section Hierarchy

DataCard
├── schema_version    (required)
├── core              (required)
├── data              (required)
├── rights            (required)
├── provenance        (required)
├── access            (required)
├── synthetic         (conditional)
├── artifacts         (optional)
├── processing        (optional)
├── quality           (optional)
├── use               (optional)
├── governance        (optional)
├── safety            (optional)
├── community         (optional)
├── sources           (optional)
├── stats             (optional)
└── extensions        (optional)

Required vs. Conditional

SectionStatusTrigger
coreRequiredAlways
dataRequiredAlways
rightsRequiredAlways
provenanceRequiredAlways
accessRequiredAlways
syntheticConditionaldata.kind is synthetic or hybrid

JSON Schema Specification

The schema uses JSON Schema Draft 2020-12 with these features:

Type System

{
  "type": "string",
  "type": "number",
  "type": "boolean",
  "type": "array",
  "type": "object"
}

Enumerations

Many fields use enums to restrict values:

{
  "data.kind": {
    "enum": ["real", "synthetic", "hybrid"]
  },
  "access.availability": {
    "enum": ["public-download", "restricted", "on-request", "not-available"]
  }
}

Conditional Logic

The schema uses if/then/else for conditional requirements:

{
  "if": {
    "properties": {
      "data": {
        "properties": {
          "kind": { "enum": ["synthetic", "hybrid"] }
        }
      }
    }
  },
  "then": {
    "required": ["synthetic"]
  }
}

Versioning

Semantic Versioning

The schema follows Semantic Versioning:

  • MAJOR (1.x.x) - Breaking changes to required fields
  • MINOR (x.1.x) - New optional fields, non-breaking changes
  • PATCH (x.x.1) - Documentation fixes, clarifications

Current Version

The current schema version is 1.0. Specify it in your data card:

{
  "schema_version": "llm-datacard/v1.0"
}

Backwards Compatibility

  • New optional fields don’t break existing cards
  • Required field changes are versioned
  • Deprecated fields emit warnings, not errors

Validation

Using the Validator

Validate your data card against the schema:

  1. Go to the Validator tool
  2. Paste or upload your JSON
  3. Review any errors or warnings

Programmatic Validation

Use Ajv for JSON Schema Draft 2020-12:

import Ajv from "ajv/dist/2020";
import addFormats from "ajv-formats";

const ajv = new Ajv({ allErrors: true });
addFormats(ajv);

const validate = ajv.compile(schema);
const valid = validate(dataCard);

Next Steps