Synthetic & Hybrid Data

This example shows how to document datasets containing synthetic (AI-generated) data, including the required synthetic section for hybrid datasets.

When to Use This Pattern

Use this structure when:

  • Your dataset contains AI-generated content
  • You’re mixing real data with synthetic augmentation
  • You need to document the generation methodology

Complete Example

{
  "schema_version": "llm-datacard/v1.0",
  "core": {
    "id": "synthetic-conversations",
    "version": "1.0.0",
    "title": "Synthetic Multi-Turn Conversations",
    "summary": "A fully synthetic dataset of multi-turn conversations generated for instruction-following and dialogue system research.",
    "maintainer": "AI Research Lab",
    "contact": "synthetic-data@ailab.org"
  },
  "data": {
    "kind": "synthetic",
    "modalities": ["text"],
    "languages": ["en"],
    "size": {
      "examples": 100000,
      "tokens": 25000000,
      "tokenizer": "cl100k_base"
    },
    "domains": ["conversation", "general"],
    "structures": ["conversation", "instruction-response"],
    "task_types": ["supervised-finetuning"],
    "record_format": "json-structured"
  },
  "rights": {
    "license": "Apache-2.0",
    "allows_commercial_use": true,
    "contains_personal_data": "none"
  },
  "provenance": {
    "source_types": ["synthetic-generation"],
    "collection_start_date": "2024-09-01",
    "collection_end_date": "2024-11-30",
    "collection_notes": "Conversations generated using a multi-agent simulation framework with Claude and GPT-4 as conversation participants."
  },
  "access": {
    "availability": "public-download",
    "url": "https://datapass.meetkai.ai/registry/synthetic-conversations/1.0.0"
  },
  "synthetic": {
    "share_of_dataset": 1.0,
    "generation_method": "Multi-agent conversation simulation using prompted LLMs. Each conversation involves 2-4 simulated personas with distinct traits and objectives.",
    "generation_model": "claude-3-sonnet-20240229, gpt-4-turbo-2024-04-09",
    "seed_data": {
      "description": "Persona templates and conversation scenarios derived from public dialogue datasets and manually crafted prompts."
    },
    "privacy_controls": [
      "No real user data used",
      "Generated names and details are fictional",
      "PII detection filtering applied post-generation"
    ],
    "quality_checks": ["human-evaluation", "automated-filtering", "deduplication", "toxicity-filtering"]
  },
  "use": {
    "intended_uses": ["Dialogue system training", "Instruction-following research", "Conversation quality evaluation"],
    "out_of_scope_uses": ["Impersonating real individuals", "Generating deceptive content"]
  },
  "safety": {
    "content_risk_level": "low",
    "known_risky_categories": [],
    "mitigations": "All generated content filtered for toxicity, PII, and harmful instructions using automated classifiers and human spot-checks."
  }
}

The Synthetic Section

When It’s Required

The synthetic section is required when:

  • data.kind is "synthetic" (100% generated)
  • data.kind is "hybrid" (mix of real and synthetic)
"data": {
  "kind": "synthetic"  // or "hybrid"
}

Required Synthetic Fields

At minimum, you must provide:

"synthetic": {
  "share_of_dataset": 1.0,
  "generation_method": "Description of how data was generated"
}
FieldTypeDescription
share_of_datasetnumberFraction of synthetic data (0.0 to 1.0)
generation_methodstringHow the data was generated

Share of Dataset

For fully synthetic datasets:

"share_of_dataset": 1.0

For hybrid datasets (e.g., 35% synthetic):

"share_of_dataset": 0.35

Generation Model

Specify the model(s) used for generation:

"generation_model": "claude-3-sonnet-20240229, gpt-4-turbo-2024-04-09"

Include model versions when possible for reproducibility.

Seed Data

Document any real data used to seed generation:

"seed_data": {
  "dataset_refs": ["original-dataset-id"],
  "description": "Topics and formats derived from the original dataset"
}

Quality Controls

Document filtering and validation steps:

"quality_checks": [
  "human-evaluation",
  "automated-filtering",
  "deduplication",
  "toxicity-filtering"
]

Common quality checks:

  • human-evaluation - Manual review by annotators
  • automated-filtering - Algorithmic quality filters
  • deduplication - Removal of duplicates
  • toxicity-filtering - Harmful content removal
  • language-detection - Language verification

Privacy Controls

For synthetic data generated from real sources:

"privacy_controls": [
  "differential-privacy",
  "data-masking",
  "pseudonymization"
]

Hybrid Dataset Example

For a dataset that’s partially synthetic:

{
  "data": {
    "kind": "hybrid"
  },
  "synthetic": {
    "share_of_dataset": 0.35,
    "generation_method": "GPT-4 was used to generate answers for crowdsourced questions, with human validation and correction.",
    "generation_model": "gpt-4-0613",
    "quality_checks": ["human-evaluation", "automated-filtering", "deduplication"]
  },
  "provenance": {
    "source_types": ["crowdsourced", "synthetic-generation"]
  }
}

Note how provenance.source_types includes both real (crowdsourced) and synthetic (synthetic-generation) sources.

Best Practices

  1. Be Specific About Methods - Don’t just say “AI generated”; describe the process
  2. Document Models - Include model names and versions
  3. Explain Quality Measures - How did you ensure the synthetic data is useful?
  4. Disclose Seed Data - If generation was informed by real data, say so
  5. Consider Safety - Synthetic data can still contain harmful content

Try It