Synthetic Section

The synthetic section documents how synthetic data was generated. This section becomes required when data.kind is set to “synthetic” or “hybrid”.

Required Fields (when synthetic/hybrid)

FieldTypeDescription
generation_methodstringHow the synthetic data was generated
share_of_datasetnumberApproximate fraction of examples that are synthetic (0-1)

Optional Fields

FieldTypeDescription
generation_modelstringModel or system used for generation
seed_dataobjectInformation about seed data used
privacy_controlsarrayPrivacy-related controls applied
quality_checksarrayQuality checks applied to synthetic data

seed_data Object

FieldTypeDescription
dataset_refsarrayIDs of datasets used as seed
descriptionstringDescription of seed data and usage

Examples

Fully Synthetic Dataset

{
  "data": {
    "kind": "synthetic",
    "modalities": ["text"],
    "languages": ["en"],
    "size": { "examples": 100000 },
    "domains": ["general"],
    "record_format": "chat-messages"
  },
  "synthetic": {
    "share_of_dataset": 1.0,
    "generation_method": "prompted-dialogue",
    "generation_model": "Claude 3 Opus",
    "seed_data": {
      "description": "Topics and personas derived from manually curated seed list of 500 conversation starters."
    },
    "privacy_controls": ["no-real-names", "no-pii-in-prompts"],
    "quality_checks": ["manual-spot-check-5%", "toxicity-filter-threshold-0.1", "coherence-score-min-0.8"]
  }
}

Hybrid Dataset

{
  "data": {
    "kind": "hybrid",
    "modalities": ["text"],
    "languages": ["ha-Latn-NG", "en"],
    "size": { "examples": 75000 },
    "domains": ["news"],
    "record_format": "plain-text"
  },
  "synthetic": {
    "share_of_dataset": 0.4,
    "generation_method": "backtranslation",
    "generation_model": "NLLB-200-3.3B",
    "seed_data": {
      "dataset_refs": ["hausa-news-corpus:2024-01"],
      "description": "40% of examples generated by round-trip translation (Hausa->English->Hausa) of real news articles to augment training data."
    },
    "quality_checks": ["bleu-threshold-30", "human-fluency-check-1%"]
  }
}

Distillation Dataset

{
  "data": {
    "kind": "synthetic",
    "modalities": ["text"],
    "languages": ["en"],
    "size": { "examples": 500000 },
    "domains": ["general"],
    "record_format": "prompt-completion",
    "task_types": ["distillation"]
  },
  "synthetic": {
    "share_of_dataset": 1.0,
    "generation_method": "teacher-model-distillation",
    "generation_model": "GPT-4-turbo-2024-04-09",
    "seed_data": {
      "dataset_refs": ["open-prompts-v2:1.0.0"],
      "description": "Responses generated from teacher model on curated prompt set."
    },
    "privacy_controls": ["prompt-filtering-pii", "response-pii-scrubbing"],
    "quality_checks": ["format-validation", "response-length-bounds", "refusal-rate-monitoring"]
  }
}

Field Details

generation_method

Common generation methods include:

MethodDescription
prompted-dialogueLLM generates conversations from prompts
backtranslationRound-trip translation for augmentation
paraphraseLLM rewrites existing content
templated-augmentationRule-based generation from templates
teacher-model-distillationLarger model generates training data for smaller model
self-instructModel generates its own instruction-following examples
data-to-textGenerating text from structured data
text-to-dataGenerating structured data from text

generation_model

Specify the model name and version for reproducibility:

  • Claude 3 Opus
  • GPT-4-turbo-2024-04-09
  • Llama-3-70B-Instruct
  • NLLB-200-3.3B
  • Custom fine-tuned model (internal)

seed_data

Document the data used as input to the generation process:

{
  "seed_data": {
    "dataset_refs": ["common-crawl:2024-05", "wikipedia-en:20240101"],
    "description": "Seed prompts extracted from web text. Personal information and copyrighted content filtered before use as generation seeds."
  }
}

Use dataset_refs to link to other data cards when possible. Format: dataset-id:version.

privacy_controls

Document safeguards against data leakage:

  • no-real-names - No real personal names in generation
  • no-pii-in-prompts - PII filtered from seed/prompt data
  • response-pii-scrubbing - PII removed from generated output
  • differential-privacy - DP techniques applied
  • prompt-filtering-pii - PII removed before prompting
  • synthetic-entities-only - All named entities are synthetic

quality_checks

Document quality assurance steps:

  • manual-spot-check-N% - Human review of N% sample
  • toxicity-filter-threshold-X - Toxicity score filtering
  • coherence-score-min-X - Minimum coherence threshold
  • format-validation - Output format verification
  • factuality-check - Fact verification process
  • deduplication - Duplicate removal

Conditional Rule

The schema enforces:

IF data.kind IN ["synthetic", "hybrid"]
THEN synthetic section REQUIRED
  WITH generation_method REQUIRED
  AND share_of_dataset REQUIRED

If you set data.kind to “synthetic” or “hybrid” without providing these fields, validation will fail.

See Also