About | DataPass

DataPass is an open standard for documenting datasets used to train and evaluate large language models. The LLM Data Card v1.0 schema provides a machine-readable format that captures essential metadata about rights, provenance, access, and data characteristics.

Why We Built This

As LLMs become more prevalent, the need for transparent, standardized dataset documentation has become critical. Dataset publishers need a guided way to produce valid cards, model builders need fast search and machine-readable export, and auditors need clear provenance and rights information.

Core Principles

Strictness by design: The schema uses additionalProperties: false throughout, ensuring every field is intentional and validated.
Language stewardship: First-class support for BCP-47 tags, autonyms, writing direction, and code-switching metadata for low-resource languages.
Practical data submission: Designed for real-world workflows with artifacts, processing pipelines, and quality metrics.
Community-driven: PR-based contribution workflow for data cards, schemas, and translations.

Get Involved

DataPass is an open project. You can contribute by:

Submitting data cards for your datasets
Contributing record schemas or extensions
Helping translate the documentation
Reporting issues or suggesting improvements

Visit our GitHub repository to get started. GitHub