Data Section

The data section provides a high-level description of what the dataset contains.

Required Fields

FieldTypeDescription
kindenumWhether the dataset is “real”, “synthetic”, or “hybrid”
modalitiesarrayData modalities present: “text”, “speech”, “audio”, “image”, “video”, “code”, “multimodal”, “multi”
languagesarrayLanguages as BCP-47 tags (e.g., “en”, “ar”, “ha-Latn-NG”)
sizeobjectDataset size with required examples count
domainsarrayContent domains (e.g., “news”, “social-media”, “health”)
record_formatenumStructure of each example

Optional Fields

FieldTypeDescription
primary_languagesarraySubset of languages that dominate the dataset
language_detailsarrayPer-language metadata for low-resource settings
language_sharesobjectApproximate fraction of examples per language (0-1)
structuresarrayHigh-level structure of training examples
task_typesarrayPrimary LLM tasks this dataset supports
record_format_notesstringNotes about record format (required if record_format="other")
splitsarrayDataset split definitions
has_human_annotationsbooleanWhether dataset has human-provided labels
label_typesarrayTypes of labels (required if has_human_annotations=true)
annotation_detailsobjectDetails about human annotations
reasoning_tracesobjectInformation about reasoning traces
modality_metadataobjectModality-specific metadata

Enum Values

kind

  • real - Real-world data
  • synthetic - Synthetically generated data
  • hybrid - Mix of real and synthetic

modalities

  • text, speech, audio, image, video, code, multimodal, multi

record_format

  • plain-text - Plain text documents
  • prompt-completion - Prompt/completion pairs
  • chat-messages - Conversational messages
  • document-pairs - Pairs of documents
  • audio-with-transcript - Audio with transcriptions
  • json-structured - Structured JSON records
  • other - Other format (requires record_format_notes)

structures

  • documents, conversation, instruction-response, translation-pairs, qa-pairs, classification-examples, summarization, other

task_types

  • pretraining, supervised-finetuning, preference-training, reward-modeling, evaluation, distillation, augmentation, other

Conditional Rules

When…You must provide…
record_format is “other”record_format_notes
has_human_annotations is truelabel_types
structures contains “translation-pairs”At least 2 languages in languages array

Example

{
  "data": {
    "kind": "real",
    "modalities": ["text"],
    "languages": ["ha-Latn-NG", "en"],
    "primary_languages": ["ha-Latn-NG"],
    "language_details": [
      {
        "tag": "ha-Latn-NG",
        "autonym": "Hausa",
        "iso639_3": "hau",
        "script": "Latn",
        "writing_direction": "ltr",
        "varieties": ["Standard Hausa", "Kano dialect"],
        "code_switching": true,
        "code_switch_languages": ["en"]
      }
    ],
    "language_shares": {
      "ha-Latn-NG": 0.85,
      "en": 0.15
    },
    "size": {
      "examples": 50000,
      "tokens": 12000000
    },
    "domains": ["news", "social-media"],
    "structures": ["documents"],
    "task_types": ["pretraining", "supervised-finetuning"],
    "record_format": "plain-text",
    "has_human_annotations": true,
    "label_types": ["topic", "sentiment"]
  }
}

Language Details

The language_details array is especially important for low-resource languages:

{
  "language_details": [
    {
      "tag": "ha-Latn-NG",
      "autonym": "Hausa",
      "iso639_3": "hau",
      "glottocode": "haus1257",
      "script": "Latn",
      "writing_direction": "ltr",
      "varieties": ["Standard Hausa", "Kano dialect"],
      "orthography_notes": "Uses standard Latin orthography with hooked letters",
      "code_switching": true,
      "code_switch_languages": ["en", "ar"]
    }
  ]
}