About
DataPass is an open standard for documenting datasets used to train and evaluate large language models. The LLM Data Card v1.0 schema provides a machine-readable format that captures essential metadata about rights, provenance, access, and data characteristics.
Why We Built This
As LLMs become more prevalent, the need for transparent, standardized dataset documentation has become critical. Dataset publishers need a guided way to produce valid cards, model builders need fast search and machine-readable export, and auditors need clear provenance and rights information.
Core Principles
- Strictness by design: The schema uses additionalProperties: false throughout, ensuring every field is intentional and validated.
- Language stewardship: First-class support for BCP-47 tags, autonyms, writing direction, and code-switching metadata for low-resource languages.
- Practical data submission: Designed for real-world workflows with artifacts, processing pipelines, and quality metrics.
- Community-driven: PR-based contribution workflow for data cards, schemas, and translations.
Get Involved
DataPass is an open project. You can contribute by:
- Submitting data cards for your datasets
- Contributing record schemas or extensions
- Helping translate the documentation
- Reporting issues or suggesting improvements
Visit our GitHub repository to get started. GitHub