Records Overview

Records Overview

Each line in a JSONL shard file is a record. The record structure depends on the training objective specified in the file configuration.

Base Record Structure

All records share these common fields:

FieldTypeRequiredDescription
idstringNoUnique record identifier
messagesarrayYesConversation messages
toolset_idstringNoOverride file/default toolset
metadataobjectNoCustom record metadata

Record Types

Training ObjectiveRecord SchemaKey Requirement
SFTsft_recordAt least one assistant message
Preferencepreference_recordCandidates with chosen/rejected labels
RFTrft_recordReference object with grading criteria
CPTcpt_recordDocument-only, no assistant/tool messages

Message Structure

Each message has:

{
  "role": "user",
  "content": [
    { "type": "text", "text": "Hello!" }
  ],
  "name": "optional-name",
  "annotations": {}
}

Roles

RoleDescriptionUsed In
systemSystem instructionsAll objectives
userUser inputAll objectives
assistantModel responseSFT, Preference, RFT
toolTool execution resultSFT, Preference, RFT
documentDocument contextCPT

Content Parts

Content is always an array of content parts. See Content Parts Reference.

Validation

Records are validated against objective-specific schemas:

  • SFT: Must have at least one message with role: "assistant"
  • Preference: Must have candidates array with chosen and rejected labels
  • RFT: Must have reference object
  • CPT: Must NOT have any assistant or tool messages

Record ID Best Practices

While id is optional, it’s recommended for:

  • Debugging and error tracking
  • Deduplication
  • Reproducibility

Use meaningful, unique IDs:

{"id": "math-001", "messages": [...]}
{"id": "code-review-042", "messages": [...]}
{"id": "uuid-550e8400-e29b-41d4-a716-446655440000", "messages": [...]}

Metadata

The metadata field can store custom information:

{
  "id": "example",
  "messages": [...],
  "metadata": {
    "source": "human-annotation",
    "annotator_id": "ann-123",
    "quality_score": 0.95
  }
}