Telco Quickstart
This guide helps mobile operator teams package training data using the LLM Training Data format. It focuses on practical data sources, the right training objective for each source, and exactly how the JSONL records should look. It assumes a mobile-first context common across African markets (prepaid dominance, multi-SIM usage, multilingual support).
What Data Mobile Operators Can Provide (and where it fits)
Conversational support interactions (SFT + Preference)
- Live chat transcripts, email threads, and ticket conversations
- Call-center summaries, chatbot logs, and agent playbooks
- Escalation summaries and handoff notes
Format hint: Convert each conversation into a messages array with user + assistant roles. For preference data,
keep the prompt in messages and put alternate responses in candidates.
Knowledge base + policy docs (CPT)
- FAQ pages, troubleshooting guides, policy docs, runbooks
- SIM registration, SIM swap policy, APN setup, roaming guidance, device setup instructions
- Network change procedures and internal reference docs
Format hint: CPT records are document-only. Use role: "document" and do not include assistant or tool
messages.
Troubleshooting workflows + diagnostics (CPT or SFT)
- Step-by-step troubleshooting decision trees (no service, call drops, data throttling)
- Common outage flows and resolution paths
- “If X then Y” playbooks and scripts
Format hint:
- CPT: flatten each workflow into a single document record.
- SFT: turn each workflow into a prompt + ideal response.
Product, plan, and device catalogs (CPT or SFT)
- Plan descriptions, bundle structures, coverage maps, device and handset specs
- Prepaid/postpaid eligibility, top-up rules, bundle activation windows
Format hint: Use content parts with type: "json" if you want to preserve structure.
QA, scoring, and review data (RL data)
- Pairwise comparisons of answers (chosen vs. rejected)
- Ranked lists of candidate responses
- Rubric-scored QA or verifier outputs
Format hint: Use the preference objective for pairwise/ranked data and rft for rubric-scored data with a
reference object.
Usually excluded or heavily reduced
- Call detail records (CDRs), precise location data, IMSI/IMEI
- Raw network logs with subscriber identifiers
- Billing or payment data
If these are needed, reduce or aggregate them before formatting.
Map Data Types to Objectives
| Data Type | Best Objective | Why it fits |
|---|---|---|
| Support chats / tickets | sft | Trains instruction-following behaviors |
| Support response comparisons | preference | Optimizes for better responses |
| Knowledge base / manuals | cpt | Large document corpora without conversation |
| Rubric-scored QA | rft | Teaches the model to satisfy grading criteria |
| Structured plans / device catalogs | cpt or sft | Can be docs (CPT) or Q&A (SFT) |
Package Format at a Glance
Every package has a metadata.json plus one or more JSONL shard files.
telco-training-package/
metadata.json
data/
train-00000.jsonl
train-00001.jsonl
validation-00000.jsonl
A minimal metadata.json describing multiple objectives:
{
"schema_version": "llm-training-data/v1",
"dataset": {
"name": "telco-support-mix",
"dataset_id": "telco-support-mix-v1",
"description": "Support conversations + KB docs"
},
"files": [
{
"split": "train",
"objective": "sft",
"format": "jsonl",
"shards": [{ "path": "data/train-00000.jsonl", "record_count": 12000 }]
},
{
"split": "train",
"objective": "cpt",
"format": "jsonl",
"shards": [{ "path": "data/cpt-train-00000.jsonl", "record_count": 50000 }]
}
]
}
What CPT Data Looks Like
CPT records are document-only and should not include assistant or tool messages.
{
"id": "kb-001",
"messages": [
{
"role": "document",
"content": [
{
"type": "text",
"text": "APN setup (Android): 1) Settings → Network & Internet → Mobile network → Access Point Names..."
}
]
}
],
"metadata": {
"source": "internal-kb",
"doc_type": "troubleshooting"
}
}
What SFT Data Looks Like
SFT records must include at least one assistant message.
{
"id": "sft-042",
"messages": [
{
"role": "user",
"content": [{ "type": "text", "text": "My mobile data is slow in the evening. What should I check?" }]
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "Check signal bars and network mode, then run a speed test during peak hours. If speeds are still low, share the results and your location with support so we can check local congestion or coverage."
}
]
}
]
}
What RL Data Looks Like (Preference + RFT)
Preference (pairwise or ranked)
Preference records keep the prompt in messages and store candidate responses in candidates with label values.
{
"id": "pref-001",
"messages": [
{
"role": "user",
"content": [{ "type": "text", "text": "How do I fix 'no service' on my phone?" }]
}
],
"candidates": [
{
"label": "chosen",
"messages": [
{
"role": "assistant",
"content": [{ "type": "text", "text": "Toggle airplane mode, then restart the phone. Check that your SIM is seated properly and that your account is active. If it persists, support can check for a local outage." }]
}
]
},
{
"label": "rejected",
"messages": [
{
"role": "assistant",
"content": [{ "type": "text", "text": "It will probably fix itself. Just wait." }]
}
]
}
]
}
RFT (rubric-scored)
RFT records require a reference object describing the grading criteria, expected format, or target answer.
{
"id": "rft-010",
"messages": [
{
"role": "user",
"content": [{ "type": "text", "text": "Summarize today's cell-site outage for leadership." }]
}
],
"reference": {
"rubric": {
"criteria": ["accuracy", "clarity", "actionability"],
"scale": [1, 5]
},
"notes": "Include timeline, impact, and next steps."
}
}
Quickstart Steps
Pick data sources + objectives Start with the data types above and choose the objective that matches the training goal (CPT, SFT, preference, or RFT).
Format JSONL records by objective CPT = document-only, SFT = conversation with assistant responses, preference = candidates with chosen/rejected, RFT = rubric-scored with a reference object.
Create metadata.json Describe the dataset and list each shard file with its split and objective. See the Metadata Reference for all fields.
Package shards + validate Put JSONL files under
data/, then validatemetadata.jsonwith the DataPass Validator. Use the Sample Packages if you want a copyable template.Compliance + publishing (last) After the data is formatted, run your privacy/legal review and decide whether to publish a Data Card. If you do, be explicit about rights, provenance, access, and personal data handling.