Telco Quickstart

This guide helps mobile operator teams package training data using the LLM Training Data format. It focuses on practical data sources, the right training objective for each source, and exactly how the JSONL records should look. It assumes a mobile-first context common across African markets (prepaid dominance, multi-SIM usage, multilingual support).

What Data Mobile Operators Can Provide (and where it fits)

Conversational support interactions (SFT + Preference)

Live chat transcripts, email threads, and ticket conversations
Call-center summaries, chatbot logs, and agent playbooks
Escalation summaries and handoff notes

Format hint: Convert each conversation into a messages array with user + assistant roles. For preference data, keep the prompt in messages and put alternate responses in candidates.

Knowledge base + policy docs (CPT)

FAQ pages, troubleshooting guides, policy docs, runbooks
SIM registration, SIM swap policy, APN setup, roaming guidance, device setup instructions
Network change procedures and internal reference docs

Format hint: CPT records are document-only. Use role: "document" and do not include assistant or tool messages.

Troubleshooting workflows + diagnostics (CPT or SFT)

Step-by-step troubleshooting decision trees (no service, call drops, data throttling)
Common outage flows and resolution paths
“If X then Y” playbooks and scripts

Format hint:

CPT: flatten each workflow into a single document record.
SFT: turn each workflow into a prompt + ideal response.

Product, plan, and device catalogs (CPT or SFT)

Plan descriptions, bundle structures, coverage maps, device and handset specs
Prepaid/postpaid eligibility, top-up rules, bundle activation windows

Format hint: Use content parts with type: "json" if you want to preserve structure.

QA, scoring, and review data (RL data)

Pairwise comparisons of answers (chosen vs. rejected)
Ranked lists of candidate responses
Rubric-scored QA or verifier outputs

Format hint: Use the preference objective for pairwise/ranked data and rft for rubric-scored data with a reference object.

Usually excluded or heavily reduced

Call detail records (CDRs), precise location data, IMSI/IMEI
Raw network logs with subscriber identifiers
Billing or payment data

If these are needed, reduce or aggregate them before formatting.

Map Data Types to Objectives

Data Type	Best Objective	Why it fits
Support chats / tickets	`sft`	Trains instruction-following behaviors
Support response comparisons	`preference`	Optimizes for better responses
Knowledge base / manuals	`cpt`	Large document corpora without conversation
Rubric-scored QA	`rft`	Teaches the model to satisfy grading criteria
Structured plans / device catalogs	`cpt` or `sft`	Can be docs (CPT) or Q&A (SFT)

Package Format at a Glance

Every package has a metadata.json plus one or more JSONL shard files.

telco-training-package/
  metadata.json
  data/
    train-00000.jsonl
    train-00001.jsonl
    validation-00000.jsonl

A minimal metadata.json describing multiple objectives:

{
  "schema_version": "llm-training-data/v1",
  "dataset": {
    "name": "telco-support-mix",
    "dataset_id": "telco-support-mix-v1",
    "description": "Support conversations + KB docs"
  },
  "files": [
    {
      "split": "train",
      "objective": "sft",
      "format": "jsonl",
      "shards": [{ "path": "data/train-00000.jsonl", "record_count": 12000 }]
    },
    {
      "split": "train",
      "objective": "cpt",
      "format": "jsonl",
      "shards": [{ "path": "data/cpt-train-00000.jsonl", "record_count": 50000 }]
    }
  ]
}

What CPT Data Looks Like

CPT records are document-only and should not include assistant or tool messages.

{
  "id": "kb-001",
  "messages": [
    {
      "role": "document",
      "content": [
        {
          "type": "text",
          "text": "APN setup (Android): 1) Settings → Network & Internet → Mobile network → Access Point Names..."
        }
      ]
    }
  ],
  "metadata": {
    "source": "internal-kb",
    "doc_type": "troubleshooting"
  }
}

What SFT Data Looks Like

SFT records must include at least one assistant message.

{
  "id": "sft-042",
  "messages": [
    {
      "role": "user",
      "content": [{ "type": "text", "text": "My mobile data is slow in the evening. What should I check?" }]
    },
    {
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "Check signal bars and network mode, then run a speed test during peak hours. If speeds are still low, share the results and your location with support so we can check local congestion or coverage."
        }
      ]
    }
  ]
}

What RL Data Looks Like (Preference + RFT)

Preference (pairwise or ranked)

Preference records keep the prompt in messages and store candidate responses in candidates with label values.

{
  "id": "pref-001",
  "messages": [
    {
      "role": "user",
      "content": [{ "type": "text", "text": "How do I fix 'no service' on my phone?" }]
    }
  ],
  "candidates": [
    {
      "label": "chosen",
      "messages": [
        {
          "role": "assistant",
          "content": [{ "type": "text", "text": "Toggle airplane mode, then restart the phone. Check that your SIM is seated properly and that your account is active. If it persists, support can check for a local outage." }]
        }
      ]
    },
    {
      "label": "rejected",
      "messages": [
        {
          "role": "assistant",
          "content": [{ "type": "text", "text": "It will probably fix itself. Just wait." }]
        }
      ]
    }
  ]
}

RFT (rubric-scored)

RFT records require a reference object describing the grading criteria, expected format, or target answer.

{
  "id": "rft-010",
  "messages": [
    {
      "role": "user",
      "content": [{ "type": "text", "text": "Summarize today's cell-site outage for leadership." }]
    }
  ],
  "reference": {
    "rubric": {
      "criteria": ["accuracy", "clarity", "actionability"],
      "scale": [1, 5]
    },
    "notes": "Include timeline, impact, and next steps."
  }
}

Quickstart Steps

Pick data sources + objectives Start with the data types above and choose the objective that matches the training goal (CPT, SFT, preference, or RFT).
Format JSONL records by objective CPT = document-only, SFT = conversation with assistant responses, preference = candidates with chosen/rejected, RFT = rubric-scored with a reference object.
Create metadata.json Describe the dataset and list each shard file with its split and objective. See the Metadata Reference for all fields.
Package shards + validate Put JSONL files under data/, then validate metadata.json with the DataPass Validator. Use the Sample Packages if you want a copyable template.
Compliance + publishing (last) After the data is formatted, run your privacy/legal review and decide whether to publish a Data Card. If you do, be explicit about rights, provenance, access, and personal data handling.