Data Section
The data section provides a high-level description of what the dataset contains.
Required Fields
| Field | Type | Description |
|---|
kind | enum | Whether the dataset is “real”, “synthetic”, or “hybrid” |
modalities | array | Data modalities present: “text”, “speech”, “audio”, “image”, “video”, “code”, “multimodal”, “multi” |
languages | array | Languages as BCP-47 tags (e.g., “en”, “ar”, “ha-Latn-NG”) |
size | object | Dataset size with required examples count |
domains | array | Content domains (e.g., “news”, “social-media”, “health”) |
record_format | enum | Structure of each example |
Optional Fields
| Field | Type | Description |
|---|
primary_languages | array | Subset of languages that dominate the dataset |
language_details | array | Per-language metadata for low-resource settings |
language_shares | object | Approximate fraction of examples per language (0-1) |
structures | array | High-level structure of training examples |
task_types | array | Primary LLM tasks this dataset supports |
record_format_notes | string | Notes about record format (required if record_format="other") |
splits | array | Dataset split definitions |
has_human_annotations | boolean | Whether dataset has human-provided labels |
label_types | array | Types of labels (required if has_human_annotations=true) |
annotation_details | object | Details about human annotations |
reasoning_traces | object | Information about reasoning traces |
modality_metadata | object | Modality-specific metadata |
Enum Values
kind
real - Real-world data
synthetic - Synthetically generated data
hybrid - Mix of real and synthetic
modalities
text, speech, audio, image, video, code, multimodal, multi
plain-text - Plain text documents
prompt-completion - Prompt/completion pairs
chat-messages - Conversational messages
document-pairs - Pairs of documents
audio-with-transcript - Audio with transcriptions
json-structured - Structured JSON records
other - Other format (requires record_format_notes)
structures
documents, conversation, instruction-response, translation-pairs, qa-pairs, classification-examples, summarization, other
task_types
pretraining, supervised-finetuning, preference-training, reward-modeling, evaluation, distillation, augmentation, other
Conditional Rules
| When… | You must provide… |
|---|
record_format is “other” | record_format_notes |
has_human_annotations is true | label_types |
structures contains “translation-pairs” | At least 2 languages in languages array |
Example
{
"data": {
"kind": "real",
"modalities": ["text"],
"languages": ["ha-Latn-NG", "en"],
"primary_languages": ["ha-Latn-NG"],
"language_details": [
{
"tag": "ha-Latn-NG",
"autonym": "Hausa",
"iso639_3": "hau",
"script": "Latn",
"writing_direction": "ltr",
"varieties": ["Standard Hausa", "Kano dialect"],
"code_switching": true,
"code_switch_languages": ["en"]
}
],
"language_shares": {
"ha-Latn-NG": 0.85,
"en": 0.15
},
"size": {
"examples": 50000,
"tokens": 12000000
},
"domains": ["news", "social-media"],
"structures": ["documents"],
"task_types": ["pretraining", "supervised-finetuning"],
"record_format": "plain-text",
"has_human_annotations": true,
"label_types": ["topic", "sentiment"]
}
}
Language Details
The language_details array is especially important for low-resource languages:
{
"language_details": [
{
"tag": "ha-Latn-NG",
"autonym": "Hausa",
"iso639_3": "hau",
"glottocode": "haus1257",
"script": "Latn",
"writing_direction": "ltr",
"varieties": ["Standard Hausa", "Kano dialect"],
"orthography_notes": "Uses standard Latin orthography with hooked letters",
"code_switching": true,
"code_switch_languages": ["en", "ar"]
}
]
}