Multilingual Datasets

This guide explains how to properly document languages in your data card using BCP-47 tags, primary language designation, and detailed language breakdowns.

Language Codes (BCP-47)

The schema uses BCP-47 language tags. These are standardized codes that identify languages and their variants.

Basic Language Tags

For most languages, use the two-letter ISO 639-1 code:

Code	Language
`en`	English
`es`	Spanish
`fr`	French
`de`	German
`ja`	Japanese
`ar`	Arabic
`pt`	Portuguese
`ru`	Russian
`hi`	Hindi
`ko`	Korean

"languages": ["en", "es", "fr", "de"]

Script Variants

Some languages have multiple writing systems. Use the script subtag to distinguish:

Code	Language	Script
`zh-Hans`	Chinese	Simplified
`zh-Hant`	Chinese	Traditional
`sr-Cyrl`	Serbian	Cyrillic
`sr-Latn`	Serbian	Latin

"languages": ["zh-Hans", "zh-Hant"]

Regional Variants

For region-specific variations, add a region subtag:

Code	Variant
`en-US`	American English
`en-GB`	British English
`pt-BR`	Brazilian Portuguese
`pt-PT`	European Portuguese
`es-MX`	Mexican Spanish
`es-ES`	European Spanish

"languages": ["en-US", "en-GB", "pt-BR", "pt-PT"]

Less Common Languages

For languages without two-letter codes, use ISO 639-3 three-letter codes:

Code	Language
`hau`	Hausa
`yor`	Yoruba
`ibo`	Igbo
`amh`	Amharic
`swa`	Swahili

"languages": ["hau", "yor", "ibo"]

Primary Languages

When your dataset has unequal language distribution, use primary_languages to indicate which languages are the focus:

{
  "data": {
    "languages": ["en", "es", "fr", "de", "it", "pt", "nl", "pl", "ru", "ja"],
    "primary_languages": ["en", "es", "fr"]
  }
}

This indicates that while the dataset contains 10 languages, English, Spanish, and French are the primary focus.

When to Use Primary Languages

Use primary_languages when:

Some languages have significantly more data than others
The dataset was designed around specific languages
Quality or coverage varies by language
You want to set user expectations

Language Details

For precise language breakdowns, use the language_details object:

{
  "data": {
    "languages": ["en", "es", "fr", "de"],
    "language_details": {
      "en": {
        "examples": 50000,
        "share": 0.4,
        "notes": "Includes US and UK English variants"
      },
      "es": {
        "examples": 35000,
        "share": 0.28
      },
      "fr": {
        "examples": 25000,
        "share": 0.2
      },
      "de": {
        "examples": 15000,
        "share": 0.12
      }
    }
  }
}

Language Details Fields

Field	Type	Description
`examples`	integer	Number of examples in this language
`tokens`	integer	Token count for this language
`share`	number	Fraction of dataset (0.0 to 1.0)
`notes`	string	Additional language-specific notes

Shares Must Sum to 1.0

When using share values, they should sum to approximately 1.0:

"language_details": {
  "en": { "share": 0.50 },
  "es": { "share": 0.30 },
  "fr": { "share": 0.20 }
}
// Total: 0.50 + 0.30 + 0.20 = 1.0

Translation Pairs

If your dataset contains parallel text (translations), add translation-pairs to structures:

{
  "data": {
    "languages": ["en", "fr"],
    "structures": ["translation-pairs"]
  }
}

Documenting Translation Direction

Use notes to clarify translation direction if relevant:

{
  "data": {
    "languages": ["en", "fr", "de"],
    "structures": ["translation-pairs"]
  },
  "provenance": {
    "collection_notes": "All translations are from English to French and German. English is the source language."
  }
}

Complete Examples

Monolingual Dataset

{
  "data": {
    "languages": ["en"],
    "modalities": ["text"]
  }
}

Bilingual Translation Dataset

{
  "data": {
    "languages": ["en", "fr"],
    "structures": ["translation-pairs"],
    "modalities": ["text"]
  }
}

Multilingual with Details

{
  "data": {
    "languages": ["en", "es", "fr", "de", "zh-Hans", "ja"],
    "primary_languages": ["en", "es"],
    "language_details": {
      "en": { "examples": 100000, "share": 0.4 },
      "es": { "examples": 75000, "share": 0.3 },
      "fr": { "examples": 30000, "share": 0.12 },
      "de": { "examples": 25000, "share": 0.1 },
      "zh-Hans": { "examples": 12500, "share": 0.05 },
      "ja": { "examples": 7500, "share": 0.03 }
    },
    "modalities": ["text"],
    "domains": ["general"]
  }
}

Low-Resource Languages

{
  "data": {
    "languages": ["hau", "yor", "ibo"],
    "modalities": ["text"],
    "domains": ["news"],
    "language_details": {
      "hau": { "examples": 50000, "notes": "Nigerian Hausa" },
      "yor": { "examples": 30000, "notes": "Standard Yoruba" },
      "ibo": { "examples": 25000, "notes": "Central Igbo dialect" }
    }
  }
}

Validation Rules

The schema enforces these language-related rules:

Rule	Trigger	Requirement
Translation pairs	`structures` contains `translation-pairs`	At least 2 languages required
Primary subset	`primary_languages` specified	Must be subset of `languages`
Language details keys	`language_details` specified	Keys must exist in `languages`

Finding Language Codes

Resources for finding the correct BCP-47 tag:

IANA Language Subtag Registry - Official registry
Language Subtag Lookup - Interactive search tool
ISO 639-3 Code Tables - Three-letter codes

Data Reference - All data section fields
Multilingual Example - Complete multilingual data card
Conditional Rules - All if/then requirements