Multilingual Dataset

This example shows how to document a multilingual dataset with 10 languages, including primary language designation and detailed metadata.

When to Use This Pattern

Use this structure when:

  • Your dataset contains multiple languages
  • You need to indicate primary vs. secondary languages
  • You want to document language-specific details

Complete Example

{
  "schema_version": "llm-datacard/v1.0",
  "core": {
    "id": "multilingual-qa",
    "version": "1.0.0",
    "title": "Multilingual QA Dataset",
    "summary": "A hybrid question-answering dataset combining real-world questions with synthetic answers, covering 10 languages with emphasis on factual knowledge and reading comprehension.",
    "maintainer": "Global AI Research Consortium",
    "contact": "datasets@globalairc.org",
    "preferred_citation": "Global AI Research Consortium (2025). Multilingual QA Dataset. DataPass Registry."
  },
  "data": {
    "kind": "hybrid",
    "modalities": ["text"],
    "languages": ["en", "es", "fr", "de", "zh-Hans", "ja", "ar", "pt", "ru", "hi"],
    "primary_languages": ["en", "es", "zh-Hans"],
    "size": {
      "examples": 250000,
      "tokens": 45000000,
      "tokenizer": "cl100k_base"
    },
    "domains": ["general", "scientific", "technical"],
    "structures": ["qa-pairs"],
    "task_types": ["supervised-finetuning"],
    "record_format": "json-structured",
    "has_human_annotations": true,
    "label_types": ["classification-labels", "ratings"]
  },
  "rights": {
    "license": "CC-BY-NC-SA-4.0",
    "license_url": "https://creativecommons.org/licenses/by-nc-sa/4.0/",
    "attribution_required": true,
    "allows_commercial_use": false,
    "contains_personal_data": "none",
    "restricted_uses": ["Commercial deployment without license agreement"]
  },
  "provenance": {
    "source_types": ["crowdsourced", "synthetic-generation"],
    "geography": { "scope": "global" },
    "collection_start_date": "2024-01-01",
    "collection_end_date": "2024-10-15",
    "collection_notes": "Questions were crowdsourced from native speakers. A subset of answers were generated using GPT-4 and validated by human annotators."
  },
  "access": {
    "availability": "restricted",
    "url": "https://datapass.meetkai.ai/registry/multilingual-qa/1.0.0",
    "terms_url": "https://datapass.meetkai.ai/registry/multilingual-qa/1.0.0/terms",
    "request_instructions": "Submit a data access request form at the dataset URL. Academic researchers typically approved within 48 hours. Commercial entities require a separate license agreement."
  },
  "use": {
    "intended_uses": [
      "Multilingual QA system training",
      "Reading comprehension research",
      "Cross-lingual transfer learning"
    ],
    "out_of_scope_uses": ["Medical or legal advice systems", "Production systems without human oversight"]
  },
  "governance": {
    "review_status": "audited",
    "last_reviewed": "2024-12-01",
    "documentation_url": "https://datapass.meetkai.ai/registry/multilingual-qa/1.0.0/docs"
  },
  "safety": {
    "content_risk_level": "low",
    "known_risky_categories": [],
    "mitigations": "All content reviewed by human annotators. Automated toxicity filtering applied during data processing."
  }
}

Key Multilingual Fields

Language Codes

Use BCP-47 language tags:

"languages": ["en", "es", "fr", "de", "zh-Hans", "ja", "ar", "pt", "ru", "hi"]
CodeLanguageScript Variant
enEnglishDefault (Latin)
esSpanishDefault (Latin)
zh-HansChineseSimplified
zh-HantChineseTraditional
jaJapaneseDefault (mixed)
arArabicDefault (Arabic)

Primary Languages

When your dataset has unequal language distribution, specify primary languages:

"primary_languages": ["en", "es", "zh-Hans"]

This indicates that English, Spanish, and Simplified Chinese are the main languages, with others being secondary.

Language Distribution (Advanced)

For detailed language breakdowns, use the language_details object (not shown in this example):

"language_details": {
  "en": {
    "examples": 50000,
    "share": 0.20
  },
  "es": {
    "examples": 40000,
    "share": 0.16
  }
}

Translation Pairs

If your dataset contains translation pairs, add translation-pairs to structures:

"structures": ["translation-pairs"]

Human Annotations

This dataset includes human annotations:

"has_human_annotations": true,
"label_types": ["classification-labels", "ratings"]

When has_human_annotations is true, you must specify what types of labels are included.

Try It