Data Section

The data section provides a high-level description of what the dataset contains.

Required Fields

Field	Type	Description
`kind`	enum	Whether the dataset is “real”, “synthetic”, or “hybrid”
`modalities`	array	Data modalities present: “text”, “speech”, “audio”, “image”, “video”, “code”, “multimodal”, “multi”
`languages`	array	Languages as BCP-47 tags (e.g., “en”, “ar”, “ha-Latn-NG”)
`size`	object	Dataset size with required `examples` count
`domains`	array	Content domains (e.g., “news”, “social-media”, “health”)
`record_format`	enum	Structure of each example

Optional Fields

Field	Type	Description
`primary_languages`	array	Subset of languages that dominate the dataset
`language_details`	array	Per-language metadata for low-resource settings
`language_shares`	object	Approximate fraction of examples per language (0-1)
`structures`	array	High-level structure of training examples
`task_types`	array	Primary LLM tasks this dataset supports
`record_format_notes`	string	Notes about record format (required if `record_format="other"`)
`splits`	array	Dataset split definitions
`has_human_annotations`	boolean	Whether dataset has human-provided labels
`label_types`	array	Types of labels (required if `has_human_annotations=true`)
`annotation_details`	object	Details about human annotations
`reasoning_traces`	object	Information about reasoning traces
`modality_metadata`	object	Modality-specific metadata

Enum Values

kind

real - Real-world data
synthetic - Synthetically generated data
hybrid - Mix of real and synthetic

modalities

text, speech, audio, image, video, code, multimodal, multi

record_format

plain-text - Plain text documents
prompt-completion - Prompt/completion pairs
chat-messages - Conversational messages
document-pairs - Pairs of documents
audio-with-transcript - Audio with transcriptions
json-structured - Structured JSON records
other - Other format (requires record_format_notes)

structures

documents, conversation, instruction-response, translation-pairs, qa-pairs, classification-examples, summarization, other

task_types

pretraining, supervised-finetuning, preference-training, reward-modeling, evaluation, distillation, augmentation, other

Conditional Rules

When…	You must provide…
`record_format` is “other”	`record_format_notes`
`has_human_annotations` is true	`label_types`
`structures` contains “translation-pairs”	At least 2 languages in `languages` array

Example

{
  "data": {
    "kind": "real",
    "modalities": ["text"],
    "languages": ["ha-Latn-NG", "en"],
    "primary_languages": ["ha-Latn-NG"],
    "language_details": [
      {
        "tag": "ha-Latn-NG",
        "autonym": "Hausa",
        "iso639_3": "hau",
        "script": "Latn",
        "writing_direction": "ltr",
        "varieties": ["Standard Hausa", "Kano dialect"],
        "code_switching": true,
        "code_switch_languages": ["en"]
      }
    ],
    "language_shares": {
      "ha-Latn-NG": 0.85,
      "en": 0.15
    },
    "size": {
      "examples": 50000,
      "tokens": 12000000
    },
    "domains": ["news", "social-media"],
    "structures": ["documents"],
    "task_types": ["pretraining", "supervised-finetuning"],
    "record_format": "plain-text",
    "has_human_annotations": true,
    "label_types": ["topic", "sentiment"]
  }
}

Language Details

The language_details array is especially important for low-resource languages:

{
  "language_details": [
    {
      "tag": "ha-Latn-NG",
      "autonym": "Hausa",
      "iso639_3": "hau",
      "glottocode": "haus1257",
      "script": "Latn",
      "writing_direction": "ltr",
      "varieties": ["Standard Hausa", "Kano dialect"],
      "orthography_notes": "Uses standard Latin orthography with hooked letters",
      "code_switching": true,
      "code_switch_languages": ["en", "ar"]
    }
  ]
}