Multilingual Datasets

This guide explains how to properly document languages in your data card using BCP-47 tags, primary language designation, and detailed language breakdowns.

Language Codes (BCP-47)

The schema uses BCP-47 language tags. These are standardized codes that identify languages and their variants.

Basic Language Tags

For most languages, use the two-letter ISO 639-1 code:

CodeLanguage
enEnglish
esSpanish
frFrench
deGerman
jaJapanese
arArabic
ptPortuguese
ruRussian
hiHindi
koKorean
"languages": ["en", "es", "fr", "de"]

Script Variants

Some languages have multiple writing systems. Use the script subtag to distinguish:

CodeLanguageScript
zh-HansChineseSimplified
zh-HantChineseTraditional
sr-CyrlSerbianCyrillic
sr-LatnSerbianLatin
"languages": ["zh-Hans", "zh-Hant"]

Regional Variants

For region-specific variations, add a region subtag:

CodeVariant
en-USAmerican English
en-GBBritish English
pt-BRBrazilian Portuguese
pt-PTEuropean Portuguese
es-MXMexican Spanish
es-ESEuropean Spanish
"languages": ["en-US", "en-GB", "pt-BR", "pt-PT"]

Less Common Languages

For languages without two-letter codes, use ISO 639-3 three-letter codes:

CodeLanguage
hauHausa
yorYoruba
iboIgbo
amhAmharic
swaSwahili
"languages": ["hau", "yor", "ibo"]

Primary Languages

When your dataset has unequal language distribution, use primary_languages to indicate which languages are the focus:

{
  "data": {
    "languages": ["en", "es", "fr", "de", "it", "pt", "nl", "pl", "ru", "ja"],
    "primary_languages": ["en", "es", "fr"]
  }
}

This indicates that while the dataset contains 10 languages, English, Spanish, and French are the primary focus.

When to Use Primary Languages

Use primary_languages when:

  • Some languages have significantly more data than others
  • The dataset was designed around specific languages
  • Quality or coverage varies by language
  • You want to set user expectations

Language Details

For precise language breakdowns, use the language_details object:

{
  "data": {
    "languages": ["en", "es", "fr", "de"],
    "language_details": {
      "en": {
        "examples": 50000,
        "share": 0.4,
        "notes": "Includes US and UK English variants"
      },
      "es": {
        "examples": 35000,
        "share": 0.28
      },
      "fr": {
        "examples": 25000,
        "share": 0.2
      },
      "de": {
        "examples": 15000,
        "share": 0.12
      }
    }
  }
}

Language Details Fields

FieldTypeDescription
examplesintegerNumber of examples in this language
tokensintegerToken count for this language
sharenumberFraction of dataset (0.0 to 1.0)
notesstringAdditional language-specific notes

Shares Must Sum to 1.0

When using share values, they should sum to approximately 1.0:

"language_details": {
  "en": { "share": 0.50 },
  "es": { "share": 0.30 },
  "fr": { "share": 0.20 }
}
// Total: 0.50 + 0.30 + 0.20 = 1.0

Translation Pairs

If your dataset contains parallel text (translations), add translation-pairs to structures:

{
  "data": {
    "languages": ["en", "fr"],
    "structures": ["translation-pairs"]
  }
}

Documenting Translation Direction

Use notes to clarify translation direction if relevant:

{
  "data": {
    "languages": ["en", "fr", "de"],
    "structures": ["translation-pairs"]
  },
  "provenance": {
    "collection_notes": "All translations are from English to French and German. English is the source language."
  }
}

Complete Examples

Monolingual Dataset

{
  "data": {
    "languages": ["en"],
    "modalities": ["text"]
  }
}

Bilingual Translation Dataset

{
  "data": {
    "languages": ["en", "fr"],
    "structures": ["translation-pairs"],
    "modalities": ["text"]
  }
}

Multilingual with Details

{
  "data": {
    "languages": ["en", "es", "fr", "de", "zh-Hans", "ja"],
    "primary_languages": ["en", "es"],
    "language_details": {
      "en": { "examples": 100000, "share": 0.4 },
      "es": { "examples": 75000, "share": 0.3 },
      "fr": { "examples": 30000, "share": 0.12 },
      "de": { "examples": 25000, "share": 0.1 },
      "zh-Hans": { "examples": 12500, "share": 0.05 },
      "ja": { "examples": 7500, "share": 0.03 }
    },
    "modalities": ["text"],
    "domains": ["general"]
  }
}

Low-Resource Languages

{
  "data": {
    "languages": ["hau", "yor", "ibo"],
    "modalities": ["text"],
    "domains": ["news"],
    "language_details": {
      "hau": { "examples": 50000, "notes": "Nigerian Hausa" },
      "yor": { "examples": 30000, "notes": "Standard Yoruba" },
      "ibo": { "examples": 25000, "notes": "Central Igbo dialect" }
    }
  }
}

Validation Rules

The schema enforces these language-related rules:

RuleTriggerRequirement
Translation pairsstructures contains translation-pairsAt least 2 languages required
Primary subsetprimary_languages specifiedMust be subset of languages
Language details keyslanguage_details specifiedKeys must exist in languages

Finding Language Codes

Resources for finding the correct BCP-47 tag: