FAQ
General Questions
What is an LLM Data Card?
An LLM Data Card is a machine-readable document that describes a dataset used for training or evaluating large language models. It captures essential metadata about the dataset’s identity, contents, rights, provenance, and access information.
Why use a data card?
Data cards help with:
- Transparency: Clear documentation of what’s in a dataset
- Reproducibility: Artifact pointers with checksums for exact reproduction
- Compliance: Structured rights and consent information
- Discovery: Machine-readable format enables search and filtering
- Standardization: Common vocabulary across datasets
How strict is the schema?
Very strict. The schema uses additionalProperties: false everywhere, which means:
- Only documented fields are allowed
- Typos in field names will cause validation errors
- You cannot add custom fields outside of the
extensionssection
Technical Questions
Do I have to publish artifacts?
No. The artifacts section is optional. However, if your access.availability is public-download, you should provide either:
access.url(a landing page or download URL), orartifacts.base_uri, orartifacts.files[].urifor each file
What if I don’t know the token count?
The data.size.tokens field is optional. Only data.size.examples is required. Provide what you know.
What BCP-47 tags should I use?
Use the most specific tag appropriate for your data:
- Simple:
en,ar,ha - With script:
ha-Latn,sr-Cyrl - With region:
en-US,ha-Latn-NG
For low-resource languages, also fill in language_details with:
autonym- The language’s name in itselfiso639_3- Three-letter ISO codeglottocode- Glottolog identifier- Script and writing direction
How do I document synthetic data?
If data.kind is “synthetic” or “hybrid”, you must include the synthetic section with at least:
{
"synthetic": {
"generation_method": "prompted-dialogue",
"share_of_dataset": 0.5
}
}
What’s the difference between personal data levels?
| Level | Meaning |
|---|---|
none | No personal data |
de_minimis | Incidental personal data (e.g., names in news articles) |
pseudonymous | Data linked to pseudonyms (usernames, IDs) |
direct | Directly identifies individuals |
If not none, you must provide rights.consent_mechanism.
Validation Questions
Why am I getting “unknown property” errors?
The schema is strict. Check for:
- Typos in field names
- Fields in the wrong section
- Camel case vs snake_case (the schema uses snake_case)
Why is a field “required” that shouldn’t be?
The schema has conditional requirements. Common ones:
| When… | Then required… |
|---|---|
data.kind is “synthetic” or “hybrid” | synthetic.generation_method, synthetic.share_of_dataset |
rights.contains_personal_data != “none” | rights.consent_mechanism |
data.has_human_annotations is true | data.label_types |
access.availability is “not-available” | access.not_available_reason |
Can I validate offline?
Yes! Use Ajv with JSON Schema Draft 2020-12:
import Ajv from "ajv/dist/2020";
import addFormats from "ajv-formats";
import schema from "./schema.json";
const ajv = new Ajv({ allErrors: true });
addFormats(ajv);
const validate = ajv.compile(schema);
Contributing Questions
How do I submit a data card?
- Fork the repository
- Add your card to
registry/{dataset-id}/{version}/card.json - Open a pull request
- CI will validate your card automatically
Can I add custom fields?
Yes, use the extensions section:
{
"extensions": {
"my_org": {
"custom_field": "value",
"internal_id": "12345"
}
}
}
The extensions section allows any JSON object.
How do I contribute a record schema?
If your dataset uses record_format: "json-structured", you can contribute a schema for your record format:
- Add your schema to
gallery/record-schemas/{name}/schema.json - Include a
meta.jsonwith title, description, and compatibility info - Open a pull request
How do I help translate?
See our Translation Guide for instructions on contributing translations for the documentation and UI.