Synthetic Section
The synthetic section documents how synthetic data was generated. This section becomes required when data.kind is set to “synthetic” or “hybrid”.
Required Fields (when synthetic/hybrid)
| Field | Type | Description |
|---|---|---|
generation_method | string | How the synthetic data was generated |
share_of_dataset | number | Approximate fraction of examples that are synthetic (0-1) |
Optional Fields
| Field | Type | Description |
|---|---|---|
generation_model | string | Model or system used for generation |
seed_data | object | Information about seed data used |
privacy_controls | array | Privacy-related controls applied |
quality_checks | array | Quality checks applied to synthetic data |
seed_data Object
| Field | Type | Description |
|---|---|---|
dataset_refs | array | IDs of datasets used as seed |
description | string | Description of seed data and usage |
Examples
Fully Synthetic Dataset
{
"data": {
"kind": "synthetic",
"modalities": ["text"],
"languages": ["en"],
"size": { "examples": 100000 },
"domains": ["general"],
"record_format": "chat-messages"
},
"synthetic": {
"share_of_dataset": 1.0,
"generation_method": "prompted-dialogue",
"generation_model": "Claude 3 Opus",
"seed_data": {
"description": "Topics and personas derived from manually curated seed list of 500 conversation starters."
},
"privacy_controls": ["no-real-names", "no-pii-in-prompts"],
"quality_checks": ["manual-spot-check-5%", "toxicity-filter-threshold-0.1", "coherence-score-min-0.8"]
}
}
Hybrid Dataset
{
"data": {
"kind": "hybrid",
"modalities": ["text"],
"languages": ["ha-Latn-NG", "en"],
"size": { "examples": 75000 },
"domains": ["news"],
"record_format": "plain-text"
},
"synthetic": {
"share_of_dataset": 0.4,
"generation_method": "backtranslation",
"generation_model": "NLLB-200-3.3B",
"seed_data": {
"dataset_refs": ["hausa-news-corpus:2024-01"],
"description": "40% of examples generated by round-trip translation (Hausa->English->Hausa) of real news articles to augment training data."
},
"quality_checks": ["bleu-threshold-30", "human-fluency-check-1%"]
}
}
Distillation Dataset
{
"data": {
"kind": "synthetic",
"modalities": ["text"],
"languages": ["en"],
"size": { "examples": 500000 },
"domains": ["general"],
"record_format": "prompt-completion",
"task_types": ["distillation"]
},
"synthetic": {
"share_of_dataset": 1.0,
"generation_method": "teacher-model-distillation",
"generation_model": "GPT-4-turbo-2024-04-09",
"seed_data": {
"dataset_refs": ["open-prompts-v2:1.0.0"],
"description": "Responses generated from teacher model on curated prompt set."
},
"privacy_controls": ["prompt-filtering-pii", "response-pii-scrubbing"],
"quality_checks": ["format-validation", "response-length-bounds", "refusal-rate-monitoring"]
}
}
Field Details
generation_method
Common generation methods include:
| Method | Description |
|---|---|
prompted-dialogue | LLM generates conversations from prompts |
backtranslation | Round-trip translation for augmentation |
paraphrase | LLM rewrites existing content |
templated-augmentation | Rule-based generation from templates |
teacher-model-distillation | Larger model generates training data for smaller model |
self-instruct | Model generates its own instruction-following examples |
data-to-text | Generating text from structured data |
text-to-data | Generating structured data from text |
generation_model
Specify the model name and version for reproducibility:
Claude 3 OpusGPT-4-turbo-2024-04-09Llama-3-70B-InstructNLLB-200-3.3BCustom fine-tuned model (internal)
seed_data
Document the data used as input to the generation process:
{
"seed_data": {
"dataset_refs": ["common-crawl:2024-05", "wikipedia-en:20240101"],
"description": "Seed prompts extracted from web text. Personal information and copyrighted content filtered before use as generation seeds."
}
}
Use dataset_refs to link to other data cards when possible. Format: dataset-id:version.
privacy_controls
Document safeguards against data leakage:
no-real-names- No real personal names in generationno-pii-in-prompts- PII filtered from seed/prompt dataresponse-pii-scrubbing- PII removed from generated outputdifferential-privacy- DP techniques appliedprompt-filtering-pii- PII removed before promptingsynthetic-entities-only- All named entities are synthetic
quality_checks
Document quality assurance steps:
manual-spot-check-N%- Human review of N% sampletoxicity-filter-threshold-X- Toxicity score filteringcoherence-score-min-X- Minimum coherence thresholdformat-validation- Output format verificationfactuality-check- Fact verification processdeduplication- Duplicate removal
Conditional Rule
The schema enforces:
IF data.kind IN ["synthetic", "hybrid"]
THEN synthetic section REQUIRED
WITH generation_method REQUIRED
AND share_of_dataset REQUIRED
If you set data.kind to “synthetic” or “hybrid” without providing these fields, validation will fail.
See Also
- Data Section - The
kindfield triggers this requirement - Provenance Section - Source types including synthetic-generation
- Quality Section - General quality documentation
- Rules Guide - All conditional validation rules