RFT Records
RFT Records
Reinforcement fine-tuning (RFT) records include reference information for automated grading, enabling training with verifiers.
Requirements
- reference object with grading criteria or correct answers
Basic Example
{
"id": "rft-001",
"messages": [
{
"role": "user",
"content": [{ "type": "text", "text": "What is 15 + 27?" }]
},
{
"role": "assistant",
"content": [{ "type": "text", "text": "42" }]
}
],
"reference": {
"grading": { "type": "exact_match" },
"answer": "42"
}
}
Reference Structure
The reference object is flexible but typically contains:
| Field | Description |
|---|---|
grading | Grading method configuration |
answer / answers | Expected answer(s) |
rubric | Rubric criteria for complex tasks |
Grading Types
Exact Match
{
"reference": {
"grading": { "type": "exact_match" },
"answer": "42"
}
}
Numeric Tolerance
{
"reference": {
"grading": { "type": "numeric", "tolerance": 0.01 },
"answer": 3.14159
}
}
Multiple Correct Answers
{
"reference": {
"grading": { "type": "any_of" },
"answers": ["Paris", "paris", "PARIS"]
}
}
Rubric-Based
{
"reference": {
"grading": { "type": "rubric" },
"rubric": {
"must_have": ["mentions Paris", "correct country"],
"nice_to_have": ["includes population", "mentions landmarks"],
"must_not_have": ["incorrect facts", "harmful content"]
}
}
}
Multi-Step Problems
For problems requiring multiple steps:
{
"id": "rft-multi-001",
"messages": [
{
"role": "user",
"content": [{ "type": "text", "text": "A store sells apples for $2 each. If you buy 5 apples and pay with a $20 bill, how much change do you get?" }]
},
{
"role": "assistant",
"content": [
{ "type": "reasoning", "text": "Cost = 5 * $2 = $10. Change = $20 - $10 = $10." },
{ "type": "text", "text": "$10" }
]
}
],
"reference": {
"grading": { "type": "exact_match", "format": "currency" },
"answers": {
"final": "$10",
"intermediate": {
"cost": "$10"
}
}
}
}
With Tool Use
{
"id": "rft-tool-001",
"messages": [
{
"role": "user",
"content": [{ "type": "text", "text": "Calculate the compound interest on $1000 at 5% for 3 years." }]
},
{
"role": "assistant",
"content": [
{ "type": "tool_call", "name": "calculator", "call_id": "c1", "arguments": { "expression": "1000 * (1.05 ^ 3)" } }
]
},
{
"role": "tool",
"content": [
{ "type": "tool_result", "name": "calculator", "call_id": "c1", "result": { "value": 1157.625 } }
]
},
{
"role": "assistant",
"content": [{ "type": "text", "text": "The total after 3 years is $1,157.63, so the compound interest earned is $157.63." }]
}
],
"reference": {
"grading": { "type": "numeric", "tolerance": 0.01 },
"answers": {
"total": 1157.63,
"interest": 157.63
}
}
}
Best Practices
- Clear grading criteria - Specify exactly how answers should be evaluated
- Handle edge cases - Include acceptable variations in answers
- Intermediate steps - For multi-step problems, track intermediate answers
- Consistent format - Use the same reference structure across similar tasks