Records Overview

Each line in a JSONL shard file is a record. The record structure depends on the training objective specified in the file configuration.

Base Record Structure

All records share these common fields:

Field	Type	Required	Description
`id`	string	No	Unique record identifier
`messages`	array	Yes	Conversation messages
`toolset_id`	string	No	Override file/default toolset
`metadata`	object	No	Custom record metadata

Record Types

Training Objective	Record Schema	Key Requirement
SFT	`sft_record`	At least one assistant message
Preference	`preference_record`	Candidates with chosen/rejected labels
RFT	`rft_record`	Reference object with grading criteria
CPT	`cpt_record`	Document-only, no assistant/tool messages

Message Structure

Each message has:

{
  "role": "user",
  "content": [{ "type": "text", "text": "Hello!" }],
  "name": "optional-name",
  "annotations": {}
}

Roles

Role	Description	Used In
`system`	System instructions	All objectives
`user`	User input	All objectives
`assistant`	Model response	SFT, Preference, RFT
`tool`	Tool execution result	SFT, Preference, RFT
`document`	Document context	CPT

Content Parts

Content is always an array of content parts. See Content Parts Reference.

Validation

Records are validated against objective-specific schemas:

SFT: Must have at least one message with role: "assistant"
Preference: Must have candidates array with chosen and rejected labels
RFT: Must have reference object
CPT: Must NOT have any assistant or tool messages

Record ID Best Practices

While id is optional, it’s recommended for:

Debugging and error tracking
Deduplication
Reproducibility

Use meaningful, unique IDs:

{"id": "math-001", "messages": [...]}
{"id": "code-review-042", "messages": [...]}
{"id": "uuid-550e8400-e29b-41d4-a716-446655440000", "messages": [...]}

Metadata

The metadata field can store custom information:

{
  "id": "example",
  "messages": [...],
  "metadata": {
    "source": "human-annotation",
    "annotator_id": "ann-123",
    "quality_score": 0.95
  }
}