Quickstart

This guide will help you create a valid LLM Data Card v1.0 in just a few minutes.

Minimal Valid Card

Here’s the smallest valid data card you can create:

{
  "schema_version": "llm-datacard/v1.0",
  "core": {
    "id": "my-dataset",
    "version": "1.0.0",
    "title": "My Dataset",
    "summary": "A brief description of what this dataset contains.",
    "maintainer": "Your Organization",
    "contact": "data@meetkai.ai"
  },
  "data": {
    "kind": "real",
    "modalities": ["text"],
    "languages": ["en"],
    "size": {
      "examples": 10000
    },
    "domains": ["news"],
    "record_format": "plain-text"
  },
  "rights": {
    "license": "CC-BY-4.0",
    "allows_commercial_use": true,
    "contains_personal_data": "none"
  },
  "provenance": {
    "source_types": ["web-scrape"]
  },
  "access": {
    "availability": "public-download",
    "url": "https://datapass.meetkai.ai/registry/my-dataset/1.0.0"
  }
}

Step-by-Step Guide

  1. Set the schema version

    Always start with "schema_version": "llm-datacard/v1.0".

  2. Fill in core identity

    The core section identifies your dataset:

    • id: A machine-friendly slug (letters, numbers, dots, hyphens, underscores)
    • version: Your version label (e.g., “1.0.0” or “2025-01-15”)
    • title: Human-readable name
    • summary: 1-3 sentence description
    • maintainer: Who maintains this dataset
    • contact: Email or URL for questions
  3. Describe the data

    The data section describes what’s in your dataset:

    • kind: “real”, “synthetic”, or “hybrid”
    • modalities: Array of [“text”, “speech”, “audio”, “image”, “video”, “code”, “multimodal”]
    • languages: Array of BCP-47 tags (e.g., “en”, “ar”, “ha-Latn-NG”)
    • size.examples: Number of examples/records
    • domains: Content domains (e.g., “news”, “social-media”, “health”)
    • record_format: Structure of each example
  4. Specify rights

    The rights section covers licensing:

    • license: SPDX identifier preferred (e.g., “MIT”, “CC-BY-4.0”)
    • allows_commercial_use: Boolean
    • contains_personal_data: “none”, “de_minimis”, “pseudonymous”, or “direct”
  5. Document provenance

    The provenance section explains where data came from:

    • source_types: Array of source types
  6. Define access

    The access section explains how to get the dataset:

    • availability: “public-download”, “restricted”, “on-request”, or “not-available”
    • Include url or request_instructions as appropriate

Common Conditional Requirements

The schema has smart validation rules. Here are the most common ones:

When…You must also provide…
data.kind is “synthetic” or “hybrid”synthetic.generation_method and synthetic.share_of_dataset
rights.contains_personal_data is not “none”rights.consent_mechanism
access.availability is “restricted” or “on-request”access.request_instructions or access.url
access.availability is “not-available”access.not_available_reason
data.has_human_annotations is truedata.label_types

Validate Your Card

Use our Validator tool to check your card against the schema, or run validation locally:

npm install ajv ajv-formats
import Ajv from "ajv/dist/2020";
import addFormats from "ajv-formats";

const ajv = new Ajv({ allErrors: true });
addFormats(ajv);

const schema = await fetch("https://datapass.meetkai.ai/schemas/llm-datacard/v1.0/schema.json").then((r) => r.json());

const validate = ajv.compile(schema);
const valid = validate(yourDataCard);

if (!valid) {
  console.log(validate.errors);
}

Next Steps