Training Data + Metadata for LLMs
DataPass centers on the LLM Training Data spec: metadata.json + JSONL packages for SFT, preference, RFT, and CPT. Data Cards remain for dataset documentation.
{
"schema_version": "llm-training-data/v1",
"dataset": {
"name": "telco-support-sft",
"dataset_id": "telco-support-sft-v1",
"description": "Anonymized broadband troubleshooting conversations",
"license": "Proprietary",
"contact": "data@telco.meetkai.ai",
"tags": ["telecom", "customer-support", "broadband"]
},
"files": [
{
"split": "train",
"objective": "sft",
"shards": [
{ "path": "data/train-00000.jsonl", "record_count": 1 },
{ "path": "data/train-00001.jsonl", "record_count": 1 }
]
}
]
} {
"schema_version": "llm-datacard/v1.0",
"core": {
"id": "telco-support-sft",
"version": "1.0.0",
"title": "Telco Support Conversations",
"summary": "Anonymized customer support chats for broadband troubleshooting",
"maintainer": "MeetKai Telco",
"contact": "data@telco.meetkai.ai"
},
"data": {
"kind": "real",
"modalities": ["text"],
"languages": ["en"],
"size": { "examples": 250000 },
"domains": ["customer-support", "telecom"],
"record_format": "chat",
"task_types": ["supervised-finetuning"],
"structures": ["conversations"]
},
"rights": {
"license": "Proprietary",
"allows_commercial_use": false,
"contains_personal_data": "pseudonymous"
},
"provenance": {
"source_types": ["support-chats"]
},
"access": {
"availability": "restricted",
"request_instructions": "Contact data@telco.meetkai.ai for access review."
}
} Start with LLM Training Data
Package SFT, preference, RFT, or CPT records with a standard metadata.json and JSONL layout.
Training Data Overview
Understand the format, objectives, and file structure.
Explore the formatTraining Data Quickstart
Create a minimal SFT dataset in minutes.
Start the quickstartTelco Quickstart
Guidance for telecom teams on what data to share and how.
Read the telco guideSample Packages
Copy realistic metadata.json + JSONL examples.
View sample packagesWhy DataPass?
A standard designed for real-world data submission and low-resource language stewardship
Rights Clarity
Clear licensing, commercial use status, and personal data characterization in a machine-readable format.
Provenance Tracking
Document source types, geography, collection windows, and data origins with structured fields.
Access Information
Standardized availability status with request instructions and artifact pointers.
Language Stewardship
BCP-47 tags, autonyms, scripts, writing direction, and code-switching metadata for low-resource languages.
Schema Validation
Strict JSON Schema with additionalProperties: false ensures consistent, predictable data cards.
Community Driven
PR-based contribution workflow for data cards, schemas, and translations.
Strict Schema for Reliable Data Cards
The LLM Data Card v1.0 schema uses additionalProperties: false throughout, ensuring every field is intentional and validated.
- Required sections: core, data, rights, provenance, access
- Optional sections: artifacts, processing, quality, synthetic, use, governance, safety, community, sources, stats, extensions
- Conditional rules: Smart validation that requires fields based on context
Conditional Validation Examples
-
data.kind = "synthetic"requiressynthetic.generation_method -
rights.contains_personal_data != "none"requiresrights.consent_mechanism -
access.availability = "restricted"requiresrequest_instructionsorurl -
data.structures = "translation-pairs"requireslanguages.length >= 2
Ready to document your dataset?
Use our tools to create, validate, and publish your data card.