Synthetic & Hybrid Data
This example shows how to document datasets containing synthetic (AI-generated) data, including the required synthetic section for hybrid datasets.
When to Use This Pattern
Use this structure when:
- Your dataset contains AI-generated content
- You’re mixing real data with synthetic augmentation
- You need to document the generation methodology
Complete Example
{
"schema_version": "llm-datacard/v1.0",
"core": {
"id": "synthetic-conversations",
"version": "1.0.0",
"title": "Synthetic Multi-Turn Conversations",
"summary": "A fully synthetic dataset of multi-turn conversations generated for instruction-following and dialogue system research.",
"maintainer": "AI Research Lab",
"contact": "synthetic-data@ailab.org"
},
"data": {
"kind": "synthetic",
"modalities": ["text"],
"languages": ["en"],
"size": {
"examples": 100000,
"tokens": 25000000,
"tokenizer": "cl100k_base"
},
"domains": ["conversation", "general"],
"structures": ["conversation", "instruction-response"],
"task_types": ["supervised-finetuning"],
"record_format": "json-structured"
},
"rights": {
"license": "Apache-2.0",
"allows_commercial_use": true,
"contains_personal_data": "none"
},
"provenance": {
"source_types": ["synthetic-generation"],
"collection_start_date": "2024-09-01",
"collection_end_date": "2024-11-30",
"collection_notes": "Conversations generated using a multi-agent simulation framework with Claude and GPT-4 as conversation participants."
},
"access": {
"availability": "public-download",
"url": "https://datapass.meetkai.ai/registry/synthetic-conversations/1.0.0"
},
"synthetic": {
"share_of_dataset": 1.0,
"generation_method": "Multi-agent conversation simulation using prompted LLMs. Each conversation involves 2-4 simulated personas with distinct traits and objectives.",
"generation_model": "claude-3-sonnet-20240229, gpt-4-turbo-2024-04-09",
"seed_data": {
"description": "Persona templates and conversation scenarios derived from public dialogue datasets and manually crafted prompts."
},
"privacy_controls": [
"No real user data used",
"Generated names and details are fictional",
"PII detection filtering applied post-generation"
],
"quality_checks": ["human-evaluation", "automated-filtering", "deduplication", "toxicity-filtering"]
},
"use": {
"intended_uses": ["Dialogue system training", "Instruction-following research", "Conversation quality evaluation"],
"out_of_scope_uses": ["Impersonating real individuals", "Generating deceptive content"]
},
"safety": {
"content_risk_level": "low",
"known_risky_categories": [],
"mitigations": "All generated content filtered for toxicity, PII, and harmful instructions using automated classifiers and human spot-checks."
}
}
The Synthetic Section
When It’s Required
The synthetic section is required when:
data.kindis"synthetic"(100% generated)data.kindis"hybrid"(mix of real and synthetic)
"data": {
"kind": "synthetic" // or "hybrid"
}
Required Synthetic Fields
At minimum, you must provide:
"synthetic": {
"share_of_dataset": 1.0,
"generation_method": "Description of how data was generated"
}
| Field | Type | Description |
|---|---|---|
share_of_dataset | number | Fraction of synthetic data (0.0 to 1.0) |
generation_method | string | How the data was generated |
Share of Dataset
For fully synthetic datasets:
"share_of_dataset": 1.0
For hybrid datasets (e.g., 35% synthetic):
"share_of_dataset": 0.35
Generation Model
Specify the model(s) used for generation:
"generation_model": "claude-3-sonnet-20240229, gpt-4-turbo-2024-04-09"
Include model versions when possible for reproducibility.
Seed Data
Document any real data used to seed generation:
"seed_data": {
"dataset_refs": ["original-dataset-id"],
"description": "Topics and formats derived from the original dataset"
}
Quality Controls
Document filtering and validation steps:
"quality_checks": [
"human-evaluation",
"automated-filtering",
"deduplication",
"toxicity-filtering"
]
Common quality checks:
human-evaluation- Manual review by annotatorsautomated-filtering- Algorithmic quality filtersdeduplication- Removal of duplicatestoxicity-filtering- Harmful content removallanguage-detection- Language verification
Privacy Controls
For synthetic data generated from real sources:
"privacy_controls": [
"differential-privacy",
"data-masking",
"pseudonymization"
]
Hybrid Dataset Example
For a dataset that’s partially synthetic:
{
"data": {
"kind": "hybrid"
},
"synthetic": {
"share_of_dataset": 0.35,
"generation_method": "GPT-4 was used to generate answers for crowdsourced questions, with human validation and correction.",
"generation_model": "gpt-4-0613",
"quality_checks": ["human-evaluation", "automated-filtering", "deduplication"]
},
"provenance": {
"source_types": ["crowdsourced", "synthetic-generation"]
}
}
Note how provenance.source_types includes both real (crowdsourced) and synthetic (synthetic-generation) sources.
Best Practices
- Be Specific About Methods - Don’t just say “AI generated”; describe the process
- Document Models - Include model names and versions
- Explain Quality Measures - How did you ensure the synthetic data is useful?
- Disclose Seed Data - If generation was informed by real data, say so
- Consider Safety - Synthetic data can still contain harmful content
Related Guides
- Synthetic Data Reference - All synthetic fields
- Data Reference - Data kind options
Try It
- Open in Builder - Create your own synthetic data card
- Validate this example - Check schema compliance