Multilingual Dataset
This example shows how to document a multilingual dataset with 10 languages, including primary language designation and detailed metadata.
When to Use This Pattern
Use this structure when:
- Your dataset contains multiple languages
- You need to indicate primary vs. secondary languages
- You want to document language-specific details
Complete Example
{
"schema_version": "llm-datacard/v1.0",
"core": {
"id": "multilingual-qa",
"version": "1.0.0",
"title": "Multilingual QA Dataset",
"summary": "A hybrid question-answering dataset combining real-world questions with synthetic answers, covering 10 languages with emphasis on factual knowledge and reading comprehension.",
"maintainer": "Global AI Research Consortium",
"contact": "datasets@globalairc.org",
"preferred_citation": "Global AI Research Consortium (2025). Multilingual QA Dataset. DataPass Registry."
},
"data": {
"kind": "hybrid",
"modalities": ["text"],
"languages": ["en", "es", "fr", "de", "zh-Hans", "ja", "ar", "pt", "ru", "hi"],
"primary_languages": ["en", "es", "zh-Hans"],
"size": {
"examples": 250000,
"tokens": 45000000,
"tokenizer": "cl100k_base"
},
"domains": ["general", "scientific", "technical"],
"structures": ["qa-pairs"],
"task_types": ["supervised-finetuning"],
"record_format": "json-structured",
"has_human_annotations": true,
"label_types": ["classification-labels", "ratings"]
},
"rights": {
"license": "CC-BY-NC-SA-4.0",
"license_url": "https://creativecommons.org/licenses/by-nc-sa/4.0/",
"attribution_required": true,
"allows_commercial_use": false,
"contains_personal_data": "none",
"restricted_uses": ["Commercial deployment without license agreement"]
},
"provenance": {
"source_types": ["crowdsourced", "synthetic-generation"],
"geography": { "scope": "global" },
"collection_start_date": "2024-01-01",
"collection_end_date": "2024-10-15",
"collection_notes": "Questions were crowdsourced from native speakers. A subset of answers were generated using GPT-4 and validated by human annotators."
},
"access": {
"availability": "restricted",
"url": "https://datapass.meetkai.ai/registry/multilingual-qa/1.0.0",
"terms_url": "https://datapass.meetkai.ai/registry/multilingual-qa/1.0.0/terms",
"request_instructions": "Submit a data access request form at the dataset URL. Academic researchers typically approved within 48 hours. Commercial entities require a separate license agreement."
},
"use": {
"intended_uses": [
"Multilingual QA system training",
"Reading comprehension research",
"Cross-lingual transfer learning"
],
"out_of_scope_uses": ["Medical or legal advice systems", "Production systems without human oversight"]
},
"governance": {
"review_status": "audited",
"last_reviewed": "2024-12-01",
"documentation_url": "https://datapass.meetkai.ai/registry/multilingual-qa/1.0.0/docs"
},
"safety": {
"content_risk_level": "low",
"known_risky_categories": [],
"mitigations": "All content reviewed by human annotators. Automated toxicity filtering applied during data processing."
}
}
Key Multilingual Fields
Language Codes
Use BCP-47 language tags:
"languages": ["en", "es", "fr", "de", "zh-Hans", "ja", "ar", "pt", "ru", "hi"]
| Code | Language | Script Variant |
|---|---|---|
en | English | Default (Latin) |
es | Spanish | Default (Latin) |
zh-Hans | Chinese | Simplified |
zh-Hant | Chinese | Traditional |
ja | Japanese | Default (mixed) |
ar | Arabic | Default (Arabic) |
Primary Languages
When your dataset has unequal language distribution, specify primary languages:
"primary_languages": ["en", "es", "zh-Hans"]
This indicates that English, Spanish, and Simplified Chinese are the main languages, with others being secondary.
Language Distribution (Advanced)
For detailed language breakdowns, use the language_details object (not shown in this example):
"language_details": {
"en": {
"examples": 50000,
"share": 0.20
},
"es": {
"examples": 40000,
"share": 0.16
}
}
Translation Pairs
If your dataset contains translation pairs, add translation-pairs to structures:
"structures": ["translation-pairs"]
Human Annotations
This dataset includes human annotations:
"has_human_annotations": true,
"label_types": ["classification-labels", "ratings"]
When has_human_annotations is true, you must specify what types of labels are included.
Related Guides
- Multilingual Datasets Guide - Detailed BCP-47 usage
- Data Reference - All data section fields
Try It
- Open in Builder - Create your own multilingual card
- Validate this example - Check schema compliance
- View in Registry - See the full registry entry