Stats Section
The stats section holds optional numeric or structured statistics about the dataset. This is a flexible object that can contain any statistics relevant to your dataset.
Schema
{
"stats": {
"type": "object",
"description": "Optional numeric or structured statistics about the dataset.",
"additionalProperties": true
}
}
Example
{
"stats": {
"total_examples": 150000,
"total_tokens": 45000000,
"vocabulary_size": 125000,
"avg_tokens_per_example": 300,
"median_tokens_per_example": 245,
"token_distribution": {
"min": 10,
"max": 8192,
"p25": 120,
"p50": 245,
"p75": 410,
"p95": 890,
"p99": 2100
},
"language_distribution": {
"ha-Latn-NG": 0.65,
"en": 0.3,
"ha-Arab": 0.05
},
"domain_distribution": {
"news": 0.45,
"social-media": 0.3,
"government": 0.15,
"literature": 0.1
},
"temporal_distribution": {
"2020": 0.15,
"2021": 0.2,
"2022": 0.25,
"2023": 0.25,
"2024": 0.15
},
"quality_metrics": {
"perplexity_mean": 45.2,
"perplexity_std": 12.8,
"lang_id_confidence_mean": 0.92
}
}
}
Common Statistics
Size Metrics
{
"stats": {
"total_examples": 500000,
"total_tokens": 150000000,
"total_characters": 750000000,
"total_bytes": 1200000000,
"vocabulary_size": 200000,
"unique_documents": 125000
}
}
Distribution Statistics
{
"stats": {
"tokens_per_example": {
"min": 5,
"max": 16384,
"mean": 312.5,
"median": 256,
"std": 245.3
},
"sentences_per_document": {
"mean": 15.2,
"median": 12,
"std": 8.7
}
}
}
Language Statistics
{
"stats": {
"language_distribution": {
"en": 0.45,
"ha": 0.35,
"yo": 0.12,
"ig": 0.08
},
"multilingual_examples": 0.15,
"code_switched_examples": 0.08
}
}
Quality Metrics
{
"stats": {
"lang_id_confidence": {
"mean": 0.94,
"below_0.8": 0.03
},
"toxicity_scores": {
"mean": 0.05,
"above_0.5": 0.02
},
"perplexity": {
"mean": 52.3,
"median": 45.1
}
}
}
Annotation Statistics
{
"stats": {
"annotation_coverage": 0.35,
"labels_per_example": {
"mean": 2.1,
"median": 2
},
"label_distribution": {
"positive": 0.42,
"negative": 0.38,
"neutral": 0.2
},
"inter_annotator_agreement": {
"metric": "Cohen's Kappa",
"value": 0.78
}
}
}
Audio/Speech Statistics
{
"stats": {
"total_hours": 1250.5,
"total_utterances": 450000,
"avg_utterance_duration_sec": 4.2,
"speakers": {
"total": 850,
"male": 420,
"female": 430
},
"sample_rates": {
"16000": 0.85,
"44100": 0.15
}
}
}
Image/Video Statistics
{
"stats": {
"total_images": 250000,
"resolution_distribution": {
"below_512": 0.1,
"512_to_1024": 0.45,
"1024_to_2048": 0.35,
"above_2048": 0.1
},
"aspect_ratios": {
"landscape": 0.55,
"portrait": 0.3,
"square": 0.15
}
}
}
Best Practices
Be Consistent
Use consistent units and naming:
{
"stats": {
"size_bytes": 1073741824,
"duration_hours": 125.5,
"count_examples": 50000
}
}
Include Distributions
Means alone can be misleading:
{
"stats": {
"tokens": {
"total": 45000000,
"mean": 300,
"median": 245,
"std": 180,
"min": 10,
"max": 8192
}
}
}
Document Computation
If stats were computed with specific tools:
{
"stats": {
"token_counts": {
"tokenizer": "tiktoken-cl100k_base",
"total": 45000000
},
"perplexity": {
"model": "gpt2",
"mean": 52.3
}
}
}
See Also
- Data Section - Size field for primary metrics
- Quality Section - Quality-related measurements
- Governance Section - Link to detailed documentation