FAQ

General Questions

What is an LLM Data Card?

An LLM Data Card is a machine-readable document that describes a dataset used for training or evaluating large language models. It captures essential metadata about the dataset’s identity, contents, rights, provenance, and access information.

Why use a data card?

Data cards help with:

  • Transparency: Clear documentation of what’s in a dataset
  • Reproducibility: Artifact pointers with checksums for exact reproduction
  • Compliance: Structured rights and consent information
  • Discovery: Machine-readable format enables search and filtering
  • Standardization: Common vocabulary across datasets

How strict is the schema?

Very strict. The schema uses additionalProperties: false everywhere, which means:

  • Only documented fields are allowed
  • Typos in field names will cause validation errors
  • You cannot add custom fields outside of the extensions section

Technical Questions

Do I have to publish artifacts?

No. The artifacts section is optional. However, if your access.availability is public-download, you should provide either:

  • access.url (a landing page or download URL), or
  • artifacts.base_uri, or
  • artifacts.files[].uri for each file

What if I don’t know the token count?

The data.size.tokens field is optional. Only data.size.examples is required. Provide what you know.

What BCP-47 tags should I use?

Use the most specific tag appropriate for your data:

  • Simple: en, ar, ha
  • With script: ha-Latn, sr-Cyrl
  • With region: en-US, ha-Latn-NG

For low-resource languages, also fill in language_details with:

  • autonym - The language’s name in itself
  • iso639_3 - Three-letter ISO code
  • glottocode - Glottolog identifier
  • Script and writing direction

How do I document synthetic data?

If data.kind is “synthetic” or “hybrid”, you must include the synthetic section with at least:

{
  "synthetic": {
    "generation_method": "prompted-dialogue",
    "share_of_dataset": 0.5
  }
}

What’s the difference between personal data levels?

LevelMeaning
noneNo personal data
de_minimisIncidental personal data (e.g., names in news articles)
pseudonymousData linked to pseudonyms (usernames, IDs)
directDirectly identifies individuals

If not none, you must provide rights.consent_mechanism.

Validation Questions

Why am I getting “unknown property” errors?

The schema is strict. Check for:

  • Typos in field names
  • Fields in the wrong section
  • Camel case vs snake_case (the schema uses snake_case)

Why is a field “required” that shouldn’t be?

The schema has conditional requirements. Common ones:

When…Then required…
data.kind is “synthetic” or “hybrid”synthetic.generation_method, synthetic.share_of_dataset
rights.contains_personal_data != “none”rights.consent_mechanism
data.has_human_annotations is truedata.label_types
access.availability is “not-available”access.not_available_reason

Can I validate offline?

Yes! Use Ajv with JSON Schema Draft 2020-12:

import Ajv from "ajv/dist/2020";
import addFormats from "ajv-formats";
import schema from "./schema.json";

const ajv = new Ajv({ allErrors: true });
addFormats(ajv);
const validate = ajv.compile(schema);

Contributing Questions

How do I submit a data card?

  1. Fork the repository
  2. Add your card to registry/{dataset-id}/{version}/card.json
  3. Open a pull request
  4. CI will validate your card automatically

Can I add custom fields?

Yes, use the extensions section:

{
  "extensions": {
    "my_org": {
      "custom_field": "value",
      "internal_id": "12345"
    }
  }
}

The extensions section allows any JSON object.

How do I contribute a record schema?

If your dataset uses record_format: "json-structured", you can contribute a schema for your record format:

  1. Add your schema to gallery/record-schemas/{name}/schema.json
  2. Include a meta.json with title, description, and compatibility info
  3. Open a pull request

How do I help translate?

See our Translation Guide for instructions on contributing translations for the documentation and UI.