FAQ

General Questions

What is an LLM Data Card?

An LLM Data Card is a machine-readable document that describes a dataset used for training or evaluating large language models. It captures essential metadata about the dataset’s identity, contents, rights, provenance, and access information.

Why use a data card?

Data cards help with:

Transparency: Clear documentation of what’s in a dataset
Reproducibility: Artifact pointers with checksums for exact reproduction
Compliance: Structured rights and legal-basis information
Discovery: Machine-readable format enables search and filtering
Standardization: Common vocabulary across datasets

How strict is the schema?

Very strict. The schema uses additionalProperties: false everywhere, which means:

Only documented fields are allowed
Typos in field names will cause validation errors
You cannot add custom fields outside of the extensions section

Technical Questions

Do I have to publish artifacts?

No. The artifacts section is optional. However, if your access.availability is public-download, you should provide either:

access.url (a landing page or download URL), or
artifacts.base_uri, or
artifacts.files[].uri for each file

What if I don’t know the token count?

The data.size.tokens field is optional. Only data.size.examples is required. Provide what you know.

What BCP-47 tags should I use?

Use the most specific tag appropriate for your data:

Simple: en, ar, ha
With script: ha-Latn, sr-Cyrl
With region: en-US, ha-Latn-NG

For low-resource languages, also fill in language_details with:

autonym - The language’s name in itself
iso639_3 - Three-letter ISO code
glottocode - Glottolog identifier
Script and writing direction

How do I document synthetic data?

If data.kind is “synthetic” or “hybrid”, you must include the synthetic section with at least:

{
  "synthetic": {
    "generation_method": "prompted-dialogue",
    "share_of_dataset": 0.5
  }
}

What’s the difference between personal data levels?

Level	Meaning
`none`	No personal data
`de_minimis`	Incidental personal data (e.g., names in news articles)
`pseudonymous`	Data linked to pseudonyms (usernames, IDs)
`direct`	Directly identifies individuals

If not none, you must provide rights.legal_basis.

Validation Questions

Why am I getting “unknown property” errors?

The schema is strict. Check for:

Typos in field names
Fields in the wrong section
Camel case vs snake_case (the schema uses snake_case)

Why is a field “required” that shouldn’t be?

The schema has conditional requirements. Common ones:

When…	Then required…
`data.kind` is “synthetic” or “hybrid”	`synthetic.generation_method`, `synthetic.share_of_dataset`
`rights.contains_personal_data` != “none”	`rights.legal_basis`
`data.has_human_annotations` is true	`data.label_types`
`access.availability` is “not-available”	`access.not_available_reason`

Can I validate offline?

Yes! Use Ajv with JSON Schema Draft 2020-12:

import Ajv from "ajv/dist/2020";
import addFormats from "ajv-formats";
import schema from "./schema.json";

const ajv = new Ajv({ allErrors: true });
addFormats(ajv);
const validate = ajv.compile(schema);

Contributing Questions

How do I submit a data card?

Fork the repository
Add your card to registry/{dataset-id}/{version}/card.json
Open a pull request
CI will validate your card automatically

Can I add custom fields?

Yes, use the extensions section:

{
  "extensions": {
    "my_org": {
      "custom_field": "value",
      "internal_id": "12345"
    }
  }
}

The extensions section allows any JSON object.

How do I contribute a record schema?

If your dataset uses record_format: "json-structured", you can contribute a schema for your record format:

Add your schema to gallery/record-schemas/{name}/schema.json
Include a meta.json with title, description, and compatibility info
Open a pull request

How do I help translate?

See our Translation Guide for instructions on contributing translations for the documentation and UI.