Quality Section

The quality section documents quality-related measurements, language identification methods, noise estimates, manual spot checks, and known issues with the dataset.

Fields

All fields in the quality section are optional.

FieldTypeDescription
language_id_methodstringHow language identification was performed
estimated_noise_levelenumSubjective noise level estimate
manual_spotcheckobjectManual spot check details
known_issuesarrayKnown issues or limitations

Manual Spotcheck Object

FieldTypeDescription
performedbooleanWhether spot check was performed
sample_sizeintegerNumber of examples reviewed
sampling_methodstringHow samples were chosen
findingsstringSummary of findings

Enum Values

estimated_noise_level

ValueDescription
lowMinimal noise, high-quality data
mediumSome noise present but manageable
highSignificant noise, may need additional filtering
unknownNoise level not assessed

Example

{
  "quality": {
    "language_id_method": "fastText lid.176.bin with threshold 0.8",
    "estimated_noise_level": "medium",
    "manual_spotcheck": {
      "performed": true,
      "sample_size": 500,
      "sampling_method": "stratified-by-source",
      "findings": "~5% mislabeled language, ~3% truncated text, ~2% encoding artifacts. Higher noise in social media subset."
    },
    "known_issues": [
      "Social media subset contains informal language and code-switching",
      "News subset over-represents political content from election years",
      "Some OCR errors in scanned document sources",
      "Occasional HTML artifacts in web-scraped content"
    ]
  }
}

Field Details

language_id_method

Document the language identification approach:

{
  "language_id_method": "CLD3 with minimum confidence 0.9 and byte threshold 50"
}

Common tools:

  • fastText lid.176.bin - Facebook’s language identification model
  • CLD3 - Compact Language Detector 3
  • langdetect - Python langdetect library
  • lingua - High-accuracy language detector
  • manual - Human language identification

Include threshold values when applicable.

estimated_noise_level

Provide honest noise estimates:

LevelTypical Characteristics
lowProfessional content, editorial review, minimal errors
mediumWeb content with standard quality, some errors acceptable
highUser-generated content, OCR output, minimal curation
unknownNot systematically assessed

manual_spotcheck

Manual review adds credibility to quality claims:

{
  "manual_spotcheck": {
    "performed": true,
    "sample_size": 1000,
    "sampling_method": "random-per-language-stratum",
    "findings": "Overall quality acceptable. Hausa subset: 92% correct language, 85% grammatically correct. English subset: 98% correct language, 95% grammatically correct. Common issues: informal abbreviations, missing diacritics."
  }
}

Sampling methods:

  • random - Simple random sample
  • stratified-by-source - Proportional to source distribution
  • stratified-by-language - Proportional to language distribution
  • random-per-language-stratum - Fixed sample per language
  • targeted-edge-cases - Focus on potential problem areas

known_issues

Be transparent about limitations:

{
  "known_issues": [
    "Web scrape subset may contain navigation text fragments",
    "Dates before 2020 have inconsistent formatting",
    "Code-switched examples may be mislabeled by language",
    "Some parallel text pairs are machine-translated rather than human",
    "Audio transcripts contain speaker diarization errors in multi-speaker segments"
  ]
}

Quality Assessment Checklist

When documenting quality, consider:

  1. Language accuracy - Is the language correctly identified?
  2. Text completeness - Are examples complete or truncated?
  3. Encoding quality - Any mojibake or encoding errors?
  4. Content relevance - Does content match stated domains?
  5. Annotation accuracy - For labeled data, are labels correct?
  6. Temporal consistency - Are dates/timestamps accurate?
  7. Duplicate content - Level of near-duplicate content?
  8. Bias indicators - Any systematic content biases?

See Also