Data Card Integration

Data Card Integration

LLM Training Data and LLM Data Card are designed to work together. This guide explains how to link them for complete dataset documentation.

LLM Training DataLLM Data Card
Data format specificationDataset documentation
How to read the dataRights, provenance, safety
Technical structureMetadata for discovery

Together, they provide:

  • Complete documentation for auditors and users
  • Machine-readable format for training pipelines
  • Discoverability through the registry

Linking from Training Data to Data Card

External Reference

Point to a hosted Data Card:

{
  "schema_version": "llm-training-data/v1",
  "data_card": {
    "uri": "https://datapass.meetkai.ai/registry/my-dataset/1.0.0/card.json",
    "sha256": "abc123..."
  },
  "dataset": { ... },
  "files": [ ... ]
}

Inline Embedding

Embed the Data Card directly:

{
  "schema_version": "llm-training-data/v1",
  "data_card": {
    "inline": true,
    "core": {
      "id": "my-dataset",
      "version": "1.0.0",
      "title": "My Training Dataset",
      "summary": "A dataset for SFT training",
      "maintainer": "My Org",
      "contact": "data@meetkai.ai"
    },
    "data": {
      "kind": "synthetic",
      "modalities": ["text"],
      "languages": ["en"],
      "size": { "examples": 100000 },
      "domains": ["general"],
      "record_format": "llm-training-data"
    },
    "rights": {
      "license": "MIT",
      "allows_commercial_use": true,
      "contains_personal_data": "none"
    },
    "provenance": {
      "source_types": ["synthetic-generation"]
    },
    "access": {
      "availability": "public-download",
      "url": "https://datapass.meetkai.ai/registry/my-dataset/1.0.0"
    }
  },
  "dataset": { ... },
  "files": [ ... ]
}

Linking from Data Card to Training Data

In the LLM Data Card, use record_format: "llm-training-data" and reference the metadata.json:

{
  "schema_version": "llm-datacard/v1.0",
  "core": { ... },
  "data": {
    "kind": "synthetic",
    "modalities": ["text"],
    "languages": ["en"],
    "size": { "examples": 100000 },
    "domains": ["general"],
    "record_format": "llm-training-data"
  },
  "artifacts": {
    "training_data_ref": {
      "metadata_uri": "s3://bucket/my-dataset/metadata.json",
      "metadata_sha256": "def456...",
      "training_objective": "sft"
    },
    "files": [
      {
        "uri": "s3://bucket/my-dataset/metadata.json",
        "size_bytes": 1234
      }
    ]
  },
  "rights": { ... },
  "provenance": { ... },
  "access": { ... }
}

Complete Bundle Structure

A fully documented dataset might look like:

my-dataset/
  card.json              # LLM Data Card
  metadata.json          # LLM Training Data (references card.json)
  assets.jsonl           # Asset index
  data/
    sft-train-00000.jsonl.gz
    sft-train-00001.jsonl.gz

card.json:

{
  "schema_version": "llm-datacard/v1.0",
  "core": { "id": "my-dataset", ... },
  "data": { "record_format": "llm-training-data", ... },
  "artifacts": {
    "training_data_ref": {
      "metadata_uri": "./metadata.json",
      "training_objective": "sft"
    }
  },
  ...
}

metadata.json:

{
  "schema_version": "llm-training-data/v1",
  "data_card": { "uri": "./card.json" },
  "dataset": { "name": "my-dataset" },
  "files": [ ... ]
}

Validation

Validate both files:

  1. Validate card.json against LLM Data Card schema
  2. Validate metadata.json against LLM Training Data schema
  3. Verify cross-references resolve correctly

Best Practices

  1. Always link - Every training dataset should have a Data Card
  2. Keep in sync - Update both when making changes
  3. Use checksums - Verify integrity of references
  4. Co-locate files - Keep card.json and metadata.json together
  5. Version together - Release both as a unit