Synthetic Section

The synthetic section documents how synthetic data was generated. This section becomes required when data.kind is set to “synthetic” or “hybrid”.

Required Fields (when synthetic/hybrid)

Field	Type	Description
`generation_method`	string	How the synthetic data was generated
`share_of_dataset`	number	Approximate fraction of examples that are synthetic (0-1)

Optional Fields

Field	Type	Description
`generation_model`	string	Model or system used for generation
`seed_data`	object	Information about seed data used
`privacy_controls`	array	Privacy-related controls applied
`quality_checks`	array	Quality checks applied to synthetic data

seed_data Object

Field	Type	Description
`dataset_refs`	array	IDs of datasets used as seed
`description`	string	Description of seed data and usage

Examples

Fully Synthetic Dataset

{
  "data": {
    "kind": "synthetic",
    "modalities": ["text"],
    "languages": ["en"],
    "size": { "examples": 100000 },
    "domains": ["general"],
    "record_format": "chat-messages"
  },
  "synthetic": {
    "share_of_dataset": 1.0,
    "generation_method": "prompted-dialogue",
    "generation_model": "Claude 3 Opus",
    "seed_data": {
      "description": "Topics and personas derived from manually curated seed list of 500 conversation starters."
    },
    "privacy_controls": ["no-real-names", "no-pii-in-prompts"],
    "quality_checks": ["manual-spot-check-5%", "toxicity-filter-threshold-0.1", "coherence-score-min-0.8"]
  }
}

Hybrid Dataset

{
  "data": {
    "kind": "hybrid",
    "modalities": ["text"],
    "languages": ["ha-Latn-NG", "en"],
    "size": { "examples": 75000 },
    "domains": ["news"],
    "record_format": "plain-text"
  },
  "synthetic": {
    "share_of_dataset": 0.4,
    "generation_method": "backtranslation",
    "generation_model": "NLLB-200-3.3B",
    "seed_data": {
      "dataset_refs": ["hausa-news-corpus:2024-01"],
      "description": "40% of examples generated by round-trip translation (Hausa->English->Hausa) of real news articles to augment training data."
    },
    "quality_checks": ["bleu-threshold-30", "human-fluency-check-1%"]
  }
}

Distillation Dataset

{
  "data": {
    "kind": "synthetic",
    "modalities": ["text"],
    "languages": ["en"],
    "size": { "examples": 500000 },
    "domains": ["general"],
    "record_format": "prompt-completion",
    "task_types": ["distillation"]
  },
  "synthetic": {
    "share_of_dataset": 1.0,
    "generation_method": "teacher-model-distillation",
    "generation_model": "GPT-4-turbo-2024-04-09",
    "seed_data": {
      "dataset_refs": ["open-prompts-v2:1.0.0"],
      "description": "Responses generated from teacher model on curated prompt set."
    },
    "privacy_controls": ["prompt-filtering-pii", "response-pii-scrubbing"],
    "quality_checks": ["format-validation", "response-length-bounds", "refusal-rate-monitoring"]
  }
}

Field Details

generation_method

Common generation methods include:

Method	Description
`prompted-dialogue`	LLM generates conversations from prompts
`backtranslation`	Round-trip translation for augmentation
`paraphrase`	LLM rewrites existing content
`templated-augmentation`	Rule-based generation from templates
`teacher-model-distillation`	Larger model generates training data for smaller model
`self-instruct`	Model generates its own instruction-following examples
`data-to-text`	Generating text from structured data
`text-to-data`	Generating structured data from text

generation_model

Specify the model name and version for reproducibility:

Claude 3 Opus
GPT-4-turbo-2024-04-09
Llama-3-70B-Instruct
NLLB-200-3.3B
Custom fine-tuned model (internal)

seed_data

Document the data used as input to the generation process:

{
  "seed_data": {
    "dataset_refs": ["common-crawl:2024-05", "wikipedia-en:20240101"],
    "description": "Seed prompts extracted from web text. Personal information and copyrighted content filtered before use as generation seeds."
  }
}

Use dataset_refs to link to other data cards when possible. Format: dataset-id:version.

privacy_controls

Document safeguards against data leakage:

no-real-names - No real personal names in generation
no-pii-in-prompts - PII filtered from seed/prompt data
response-pii-scrubbing - PII removed from generated output
differential-privacy - DP techniques applied
prompt-filtering-pii - PII removed before prompting
synthetic-entities-only - All named entities are synthetic

quality_checks

Document quality assurance steps:

manual-spot-check-N% - Human review of N% sample
toxicity-filter-threshold-X - Toxicity score filtering
coherence-score-min-X - Minimum coherence threshold
format-validation - Output format verification
factuality-check - Fact verification process
deduplication - Duplicate removal

Conditional Rule

The schema enforces:

IF data.kind IN ["synthetic", "hybrid"]
THEN synthetic section REQUIRED
  WITH generation_method REQUIRED
  AND share_of_dataset REQUIRED

If you set data.kind to “synthetic” or “hybrid” without providing these fields, validation will fail.