Multilingual Datasets
This guide explains how to properly document languages in your data card using BCP-47 tags, primary language designation, and detailed language breakdowns.
Language Codes (BCP-47)
The schema uses BCP-47 language tags. These are standardized codes that identify languages and their variants.
Basic Language Tags
For most languages, use the two-letter ISO 639-1 code:
| Code | Language |
|---|---|
en | English |
es | Spanish |
fr | French |
de | German |
ja | Japanese |
ar | Arabic |
pt | Portuguese |
ru | Russian |
hi | Hindi |
ko | Korean |
"languages": ["en", "es", "fr", "de"]
Script Variants
Some languages have multiple writing systems. Use the script subtag to distinguish:
| Code | Language | Script |
|---|---|---|
zh-Hans | Chinese | Simplified |
zh-Hant | Chinese | Traditional |
sr-Cyrl | Serbian | Cyrillic |
sr-Latn | Serbian | Latin |
"languages": ["zh-Hans", "zh-Hant"]
Regional Variants
For region-specific variations, add a region subtag:
| Code | Variant |
|---|---|
en-US | American English |
en-GB | British English |
pt-BR | Brazilian Portuguese |
pt-PT | European Portuguese |
es-MX | Mexican Spanish |
es-ES | European Spanish |
"languages": ["en-US", "en-GB", "pt-BR", "pt-PT"]
Less Common Languages
For languages without two-letter codes, use ISO 639-3 three-letter codes:
| Code | Language |
|---|---|
hau | Hausa |
yor | Yoruba |
ibo | Igbo |
amh | Amharic |
swa | Swahili |
"languages": ["hau", "yor", "ibo"]
Primary Languages
When your dataset has unequal language distribution, use primary_languages to indicate which languages are the focus:
{
"data": {
"languages": ["en", "es", "fr", "de", "it", "pt", "nl", "pl", "ru", "ja"],
"primary_languages": ["en", "es", "fr"]
}
}
This indicates that while the dataset contains 10 languages, English, Spanish, and French are the primary focus.
When to Use Primary Languages
Use primary_languages when:
- Some languages have significantly more data than others
- The dataset was designed around specific languages
- Quality or coverage varies by language
- You want to set user expectations
Language Details
For precise language breakdowns, use the language_details object:
{
"data": {
"languages": ["en", "es", "fr", "de"],
"language_details": {
"en": {
"examples": 50000,
"share": 0.4,
"notes": "Includes US and UK English variants"
},
"es": {
"examples": 35000,
"share": 0.28
},
"fr": {
"examples": 25000,
"share": 0.2
},
"de": {
"examples": 15000,
"share": 0.12
}
}
}
}
Language Details Fields
| Field | Type | Description |
|---|---|---|
examples | integer | Number of examples in this language |
tokens | integer | Token count for this language |
share | number | Fraction of dataset (0.0 to 1.0) |
notes | string | Additional language-specific notes |
Shares Must Sum to 1.0
When using share values, they should sum to approximately 1.0:
"language_details": {
"en": { "share": 0.50 },
"es": { "share": 0.30 },
"fr": { "share": 0.20 }
}
// Total: 0.50 + 0.30 + 0.20 = 1.0
Translation Pairs
If your dataset contains parallel text (translations), add translation-pairs to structures:
{
"data": {
"languages": ["en", "fr"],
"structures": ["translation-pairs"]
}
}
Documenting Translation Direction
Use notes to clarify translation direction if relevant:
{
"data": {
"languages": ["en", "fr", "de"],
"structures": ["translation-pairs"]
},
"provenance": {
"collection_notes": "All translations are from English to French and German. English is the source language."
}
}
Complete Examples
Monolingual Dataset
{
"data": {
"languages": ["en"],
"modalities": ["text"]
}
}
Bilingual Translation Dataset
{
"data": {
"languages": ["en", "fr"],
"structures": ["translation-pairs"],
"modalities": ["text"]
}
}
Multilingual with Details
{
"data": {
"languages": ["en", "es", "fr", "de", "zh-Hans", "ja"],
"primary_languages": ["en", "es"],
"language_details": {
"en": { "examples": 100000, "share": 0.4 },
"es": { "examples": 75000, "share": 0.3 },
"fr": { "examples": 30000, "share": 0.12 },
"de": { "examples": 25000, "share": 0.1 },
"zh-Hans": { "examples": 12500, "share": 0.05 },
"ja": { "examples": 7500, "share": 0.03 }
},
"modalities": ["text"],
"domains": ["general"]
}
}
Low-Resource Languages
{
"data": {
"languages": ["hau", "yor", "ibo"],
"modalities": ["text"],
"domains": ["news"],
"language_details": {
"hau": { "examples": 50000, "notes": "Nigerian Hausa" },
"yor": { "examples": 30000, "notes": "Standard Yoruba" },
"ibo": { "examples": 25000, "notes": "Central Igbo dialect" }
}
}
}
Validation Rules
The schema enforces these language-related rules:
| Rule | Trigger | Requirement |
|---|---|---|
| Translation pairs | structures contains translation-pairs | At least 2 languages required |
| Primary subset | primary_languages specified | Must be subset of languages |
| Language details keys | language_details specified | Keys must exist in languages |
Finding Language Codes
Resources for finding the correct BCP-47 tag:
- IANA Language Subtag Registry - Official registry
- Language Subtag Lookup - Interactive search tool
- ISO 639-3 Code Tables - Three-letter codes
Related Documentation
- Data Reference - All data section fields
- Multilingual Example - Complete multilingual data card
- Conditional Rules - All if/then requirements