Metadata Reference

Metadata Reference

The metadata.json file is the entry point for an LLM Training Data dataset. It defines dataset identity, defaults, toolsets, and file listings.

Required Fields

FieldTypeDescription
schema_versionstringMust be "llm-training-data/v1"
datasetobjectDataset identification and metadata
filesarrayList of data file configurations

Optional Fields

FieldTypeDescription
data_cardobjectReference to LLM Data Card
defaultsobjectDefault values for records
toolsetsarrayTool definitions
asset_indexobjectReference to assets.jsonl
extensionsobjectCustom vendor fields

Full Example

{
  "schema_version": "llm-training-data/v1",
  "dataset": {
    "name": "my-sft-dataset",
    "dataset_id": "my-sft-dataset-v1",
    "description": "A comprehensive SFT dataset",
    "created_at": "2024-12-15T00:00:00Z",
    "license": "MIT",
    "contact": "data@meetkai.ai",
    "tags": ["sft", "multimodal", "tools"]
  },
  "data_card": {
    "uri": "https://datapass.meetkai.ai/registry/my-sft-dataset/1.0.0/card.json"
  },
  "defaults": {
    "system": [
      { "type": "text", "text": "You are a helpful assistant." }
    ],
    "toolset_id": "default-tools",
    "reasoning_policy": "train"
  },
  "toolsets": [
    {
      "id": "default-tools",
      "tools": [
        {
          "name": "calculator",
          "description": "Evaluate a math expression",
          "input_schema": {
            "type": "object",
            "properties": {
              "expression": { "type": "string" }
            },
            "required": ["expression"]
          }
        }
      ]
    }
  ],
  "asset_index": {
    "path": "assets.jsonl",
    "format": "jsonl"
  },
  "files": [
    {
      "split": "train",
      "objective": "sft",
      "toolset_id": "default-tools",
      "shards": [
        {
          "path": "data/sft-train-00000.jsonl.gz",
          "compression": "gzip",
          "record_count": 50000,
          "sha256": "abc123..."
        }
      ]
    }
  ]
}

Field Details

dataset

The dataset object contains identity information:

FieldTypeRequiredDescription
namestringYesHuman-readable dataset name
dataset_idstringNoMachine-friendly identifier
descriptionstringNoBrief description
created_atstringNoISO 8601 datetime
licensestringNoSPDX license identifier
contactstringNoContact email or URL
tagsarrayNoCategorization tags

data_card

Reference to an LLM Data Card for provenance documentation:

External reference:

{
  "data_card": {
    "uri": "https://datapass.meetkai.ai/registry/my-dataset/card.json",
    "sha256": "abc123..."
  }
}

Inline embedding:

{
  "data_card": {
    "inline": true,
    "core": { ... },
    "data": { ... },
    "rights": { ... },
    "provenance": { ... },
    "access": { ... }
  }
}

defaults

Default values applied to records:

FieldTypeDescription
systemarrayDefault system message content parts
toolset_idstringDefault toolset for records
reasoning_policystringHow to handle reasoning content

reasoning_policy values:

  • train - Include reasoning in training loss (default)
  • drop - Remove reasoning content before training
  • mask - Keep in context but exclude from loss

toolsets

Array of tool definitions. See Toolsets Reference.

asset_index

Pointer to the asset deduplication index:

{
  "asset_index": {
    "path": "assets.jsonl",
    "format": "jsonl"
  }
}

files

Array of file configurations. See Shards Reference.