Stats Section

The stats section holds optional numeric or structured statistics about the dataset. This is a flexible object that can contain any statistics relevant to your dataset.

Schema

{
  "stats": {
    "type": "object",
    "description": "Optional numeric or structured statistics about the dataset.",
    "additionalProperties": true
  }
}

Example

{
  "stats": {
    "total_examples": 150000,
    "total_tokens": 45000000,
    "vocabulary_size": 125000,
    "avg_tokens_per_example": 300,
    "median_tokens_per_example": 245,
    "token_distribution": {
      "min": 10,
      "max": 8192,
      "p25": 120,
      "p50": 245,
      "p75": 410,
      "p95": 890,
      "p99": 2100
    },
    "language_distribution": {
      "ha-Latn-NG": 0.65,
      "en": 0.3,
      "ha-Arab": 0.05
    },
    "domain_distribution": {
      "news": 0.45,
      "social-media": 0.3,
      "government": 0.15,
      "literature": 0.1
    },
    "temporal_distribution": {
      "2020": 0.15,
      "2021": 0.2,
      "2022": 0.25,
      "2023": 0.25,
      "2024": 0.15
    },
    "quality_metrics": {
      "perplexity_mean": 45.2,
      "perplexity_std": 12.8,
      "lang_id_confidence_mean": 0.92
    }
  }
}

Common Statistics

Size Metrics

{
  "stats": {
    "total_examples": 500000,
    "total_tokens": 150000000,
    "total_characters": 750000000,
    "total_bytes": 1200000000,
    "vocabulary_size": 200000,
    "unique_documents": 125000
  }
}

Distribution Statistics

{
  "stats": {
    "tokens_per_example": {
      "min": 5,
      "max": 16384,
      "mean": 312.5,
      "median": 256,
      "std": 245.3
    },
    "sentences_per_document": {
      "mean": 15.2,
      "median": 12,
      "std": 8.7
    }
  }
}

Language Statistics

{
  "stats": {
    "language_distribution": {
      "en": 0.45,
      "ha": 0.35,
      "yo": 0.12,
      "ig": 0.08
    },
    "multilingual_examples": 0.15,
    "code_switched_examples": 0.08
  }
}

Quality Metrics

{
  "stats": {
    "lang_id_confidence": {
      "mean": 0.94,
      "below_0.8": 0.03
    },
    "toxicity_scores": {
      "mean": 0.05,
      "above_0.5": 0.02
    },
    "perplexity": {
      "mean": 52.3,
      "median": 45.1
    }
  }
}

Annotation Statistics

{
  "stats": {
    "annotation_coverage": 0.35,
    "labels_per_example": {
      "mean": 2.1,
      "median": 2
    },
    "label_distribution": {
      "positive": 0.42,
      "negative": 0.38,
      "neutral": 0.2
    },
    "inter_annotator_agreement": {
      "metric": "Cohen's Kappa",
      "value": 0.78
    }
  }
}

Audio/Speech Statistics

{
  "stats": {
    "total_hours": 1250.5,
    "total_utterances": 450000,
    "avg_utterance_duration_sec": 4.2,
    "speakers": {
      "total": 850,
      "male": 420,
      "female": 430
    },
    "sample_rates": {
      "16000": 0.85,
      "44100": 0.15
    }
  }
}

Image/Video Statistics

{
  "stats": {
    "total_images": 250000,
    "resolution_distribution": {
      "below_512": 0.1,
      "512_to_1024": 0.45,
      "1024_to_2048": 0.35,
      "above_2048": 0.1
    },
    "aspect_ratios": {
      "landscape": 0.55,
      "portrait": 0.3,
      "square": 0.15
    }
  }
}

Best Practices

Be Consistent

Use consistent units and naming:

{
  "stats": {
    "size_bytes": 1073741824,
    "duration_hours": 125.5,
    "count_examples": 50000
  }
}

Include Distributions

Means alone can be misleading:

{
  "stats": {
    "tokens": {
      "total": 45000000,
      "mean": 300,
      "median": 245,
      "std": 180,
      "min": 10,
      "max": 8192
    }
  }
}

Document Computation

If stats were computed with specific tools:

{
  "stats": {
    "token_counts": {
      "tokenizer": "tiktoken-cl100k_base",
      "total": 45000000
    },
    "perplexity": {
      "model": "gpt2",
      "mean": 52.3
    }
  }
}