Metadata Reference
Metadata Reference
The metadata.json file is the entry point for an LLM Training Data dataset. It defines dataset identity, defaults, toolsets, and file listings.
Required Fields
| Field | Type | Description |
|---|---|---|
schema_version | string | Must be "llm-training-data/v1" |
dataset | object | Dataset identification and metadata |
files | array | List of data file configurations |
Optional Fields
| Field | Type | Description |
|---|---|---|
data_card | object | Reference to LLM Data Card |
defaults | object | Default values for records |
toolsets | array | Tool definitions |
asset_index | object | Reference to assets.jsonl |
extensions | object | Custom vendor fields |
Full Example
{
"schema_version": "llm-training-data/v1",
"dataset": {
"name": "my-sft-dataset",
"dataset_id": "my-sft-dataset-v1",
"description": "A comprehensive SFT dataset",
"created_at": "2024-12-15T00:00:00Z",
"license": "MIT",
"contact": "data@meetkai.ai",
"tags": ["sft", "multimodal", "tools"]
},
"data_card": {
"uri": "https://datapass.meetkai.ai/registry/my-sft-dataset/1.0.0/card.json"
},
"defaults": {
"system": [
{ "type": "text", "text": "You are a helpful assistant." }
],
"toolset_id": "default-tools",
"reasoning_policy": "train"
},
"toolsets": [
{
"id": "default-tools",
"tools": [
{
"name": "calculator",
"description": "Evaluate a math expression",
"input_schema": {
"type": "object",
"properties": {
"expression": { "type": "string" }
},
"required": ["expression"]
}
}
]
}
],
"asset_index": {
"path": "assets.jsonl",
"format": "jsonl"
},
"files": [
{
"split": "train",
"objective": "sft",
"toolset_id": "default-tools",
"shards": [
{
"path": "data/sft-train-00000.jsonl.gz",
"compression": "gzip",
"record_count": 50000,
"sha256": "abc123..."
}
]
}
]
}
Field Details
dataset
The dataset object contains identity information:
| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Human-readable dataset name |
dataset_id | string | No | Machine-friendly identifier |
description | string | No | Brief description |
created_at | string | No | ISO 8601 datetime |
license | string | No | SPDX license identifier |
contact | string | No | Contact email or URL |
tags | array | No | Categorization tags |
data_card
Reference to an LLM Data Card for provenance documentation:
External reference:
{
"data_card": {
"uri": "https://datapass.meetkai.ai/registry/my-dataset/card.json",
"sha256": "abc123..."
}
}
Inline embedding:
{
"data_card": {
"inline": true,
"core": { ... },
"data": { ... },
"rights": { ... },
"provenance": { ... },
"access": { ... }
}
}
defaults
Default values applied to records:
| Field | Type | Description |
|---|---|---|
system | array | Default system message content parts |
toolset_id | string | Default toolset for records |
reasoning_policy | string | How to handle reasoning content |
reasoning_policy values:
train- Include reasoning in training loss (default)drop- Remove reasoning content before trainingmask- Keep in context but exclude from loss
toolsets
Array of tool definitions. See Toolsets Reference.
asset_index
Pointer to the asset deduplication index:
{
"asset_index": {
"path": "assets.jsonl",
"format": "jsonl"
}
}
files
Array of file configurations. See Shards Reference.