LLM Data Card v1.0

LLM Data Card v1.0

A machine-readable data card for documenting datasets used to train and evaluate large language models

datacard.json
{
  "schema_version": "llm-datacard/v1.0",
  "core": {
    "id": "my-dataset",
    "version": "1.0.0",
    "title": "My Dataset",
    "summary": "A well-documented dataset for LLM training",
    "maintainer": "Your Organization",
    "contact": "data@meetkai.ai"
  },
  "data": {
    "kind": "real",
    "modalities": ["text"],
    "languages": ["en", "ar"],
    "size": { "examples": 100000 },
    "domains": ["news", "social-media"],
    "record_format": "plain-text"
  },
  "rights": { ... },
  "provenance": { ... },
  "access": { ... }
}

Why DataPass?

A standard designed for real-world data submission and low-resource language stewardship

Rights Clarity

Clear licensing, commercial use status, and personal data characterization in a machine-readable format.

Provenance Tracking

Document source types, geography, collection windows, and data origins with structured fields.

Access Information

Standardized availability status with request instructions and artifact pointers.

Language Stewardship

BCP-47 tags, autonyms, scripts, writing direction, and code-switching metadata for low-resource languages.

Schema Validation

Strict JSON Schema with additionalProperties: false ensures consistent, predictable data cards.

Community Driven

PR-based contribution workflow for data cards, schemas, and translations.

Strict Schema for Reliable Data Cards

The LLM Data Card v1.0 schema uses additionalProperties: false throughout, ensuring every field is intentional and validated.

  • Required sections: core, data, rights, provenance, access
  • Optional sections: artifacts, processing, quality, synthetic, use, governance, safety, community, sources, stats, extensions
  • Conditional rules: Smart validation that requires fields based on context

Conditional Validation Examples

  • data.kind = "synthetic" requires synthetic.generation_method
  • rights.contains_personal_data != "none" requires rights.consent_mechanism
  • access.availability = "restricted" requires request_instructions or url
  • data.structures = "translation-pairs" requires languages.length >= 2

Ready to document your dataset?

Use our tools to create, validate, and publish your data card.