Quickstart

This guide will help you create a valid LLM Data Card v1.0 in just a few minutes.

Minimal Valid Card

Here’s the smallest valid data card you can create:

{
  "schema_version": "llm-datacard/v1.0",
  "core": {
    "id": "my-dataset",
    "version": "1.0.0",
    "title": "My Dataset",
    "summary": "A brief description of what this dataset contains.",
    "maintainer": "Your Organization",
    "contact": "data@meetkai.ai"
  },
  "data": {
    "kind": "real",
    "modalities": ["text"],
    "languages": ["en"],
    "size": {
      "examples": 10000
    },
    "domains": ["news"],
    "record_format": "plain-text"
  },
  "rights": {
    "license": "CC-BY-4.0",
    "allows_commercial_use": true,
    "contains_personal_data": "none"
  },
  "provenance": {
    "source_types": ["web-scrape"]
  },
  "access": {
    "availability": "public-download",
    "url": "https://datapass.meetkai.ai/registry/my-dataset/1.0.0"
  }
}

Step-by-Step Guide

Set the schema version

Always start with "schema_version": "llm-datacard/v1.0".
Fill in core identity

The core section identifies your dataset:
- id: A machine-friendly slug (letters, numbers, dots, hyphens, underscores)
- version: Your version label (e.g., “1.0.0” or “2025-01-15”)
- title: Human-readable name
- summary: 1-3 sentence description
- maintainer: Who maintains this dataset
- contact: Email or URL for questions
Describe the data

The data section describes what’s in your dataset:
- kind: “real”, “synthetic”, or “hybrid”
- modalities: Array of [“text”, “speech”, “audio”, “image”, “video”, “code”, “multimodal”]
- languages: Array of BCP-47 tags (e.g., “en”, “ar”, “ha-Latn-NG”)
- size.examples: Number of examples/records
- domains: Content domains (e.g., “news”, “social-media”, “health”)
- record_format: Structure of each example
Specify rights

The rights section covers licensing:
- license: SPDX identifier preferred (e.g., “MIT”, “CC-BY-4.0”)
- allows_commercial_use: Boolean
- contains_personal_data: “none”, “de_minimis”, “pseudonymous”, or “direct”
- legal_basis: Required when personal data or child data is present (e.g., “explicit-consent”, “terms-of-service”, “publicly-available”)
Document provenance

The provenance section explains where data came from:
- source_types: Array of source types
Define access

The access section explains how to get the dataset:
- availability: “public-download”, “restricted”, “on-request”, or “not-available”
- Include url or request_instructions as appropriate

Common Conditional Requirements

The schema has smart validation rules. Here are the most common ones:

When…	You must also provide…
`data.kind` is “synthetic” or “hybrid”	`synthetic.generation_method` and `synthetic.share_of_dataset`
`rights.contains_personal_data` is not “none”	`rights.legal_basis`
`rights.child_data` is `true`	`rights.legal_basis`
`access.availability` is “restricted” or “on-request”	`access.request_instructions` or `access.url`
`access.availability` is “not-available”	`access.not_available_reason`
`data.has_human_annotations` is true	`data.label_types`

Validate Your Card

Use our Validator tool to check your card against the schema, or run validation locally:

npm install ajv ajv-formats

import Ajv from "ajv/dist/2020";
import addFormats from "ajv-formats";

const ajv = new Ajv({ allErrors: true });
addFormats(ajv);

const schema = await fetch("https://datapass.meetkai.ai/schemas/llm-datacard/v1.0/schema.json").then((r) => r.json());

const validate = ajv.compile(schema);
const valid = validate(yourDataCard);

if (!valid) {
  console.log(validate.errors);
}

Next Steps

Read the Field Reference for detailed documentation
See Examples for real-world data cards
Learn about Validation Rules for conditional requirements