LLM Training Data Standard v1

Training Data + Metadata for LLMs

DataPass centers on the LLM Training Data spec: metadata.json + JSONL packages for SFT, preference, RFT, and CPT. Data Cards remain for dataset documentation.

See real sample packages you can copy →

metadata.json
{
  "schema_version": "llm-training-data/v1",
  "dataset": {
    "name": "telco-support-sft",
    "dataset_id": "telco-support-sft-v1",
    "description": "Anonymized broadband troubleshooting conversations",
    "license": "Proprietary",
    "contact": "data@telco.meetkai.ai",
    "tags": ["telecom", "customer-support", "broadband"]
  },
  "files": [
    {
      "split": "train",
      "objective": "sft",
      "shards": [
        { "path": "data/train-00000.jsonl", "record_count": 1 },
        { "path": "data/train-00001.jsonl", "record_count": 1 }
      ]
    }
  ]
}

Why DataPass?

A standard designed for real-world data submission and low-resource language stewardship

Rights Clarity

Clear licensing, commercial use status, and personal data characterization in a machine-readable format.

Provenance Tracking

Document source types, geography, collection windows, and data origins with structured fields.

Access Information

Standardized availability status with request instructions and artifact pointers.

Language Stewardship

BCP-47 tags, autonyms, scripts, writing direction, and code-switching metadata for low-resource languages.

Schema Validation

Strict JSON Schema with additionalProperties: false ensures consistent, predictable data cards.

Community Driven

PR-based contribution workflow for data cards, schemas, and translations.

Strict Schema for Reliable Data Cards

The LLM Data Card v1.0 schema uses additionalProperties: false throughout, ensuring every field is intentional and validated.

  • Required sections: core, data, rights, provenance, access
  • Optional sections: artifacts, processing, quality, synthetic, use, governance, safety, community, sources, stats, extensions
  • Conditional rules: Smart validation that requires fields based on context

Conditional Validation Examples

  • data.kind = "synthetic" requires synthetic.generation_method
  • rights.contains_personal_data != "none" requires rights.consent_mechanism
  • access.availability = "restricted" requires request_instructions or url
  • data.structures = "translation-pairs" requires languages.length >= 2

Ready to document your dataset?

Use our tools to create, validate, and publish your data card.