Quality Section

The quality section documents quality-related measurements, language identification methods, noise estimates, manual spot checks, and known issues with the dataset.

Fields

All fields in the quality section are optional.

Field	Type	Description
`language_id_method`	string	How language identification was performed
`estimated_noise_level`	enum	Subjective noise level estimate
`manual_spotcheck`	object	Manual spot check details
`known_issues`	array	Known issues or limitations

Manual Spotcheck Object

Field	Type	Description
`performed`	boolean	Whether spot check was performed
`sample_size`	integer	Number of examples reviewed
`sampling_method`	string	How samples were chosen
`findings`	string	Summary of findings

Enum Values

estimated_noise_level

Value	Description
`low`	Minimal noise, high-quality data
`medium`	Some noise present but manageable
`high`	Significant noise, may need additional filtering
`unknown`	Noise level not assessed

Example

{
  "quality": {
    "language_id_method": "fastText lid.176.bin with threshold 0.8",
    "estimated_noise_level": "medium",
    "manual_spotcheck": {
      "performed": true,
      "sample_size": 500,
      "sampling_method": "stratified-by-source",
      "findings": "~5% mislabeled language, ~3% truncated text, ~2% encoding artifacts. Higher noise in social media subset."
    },
    "known_issues": [
      "Social media subset contains informal language and code-switching",
      "News subset over-represents political content from election years",
      "Some OCR errors in scanned document sources",
      "Occasional HTML artifacts in web-scraped content"
    ]
  }
}

Field Details

language_id_method

Document the language identification approach:

{
  "language_id_method": "CLD3 with minimum confidence 0.9 and byte threshold 50"
}

Common tools:

fastText lid.176.bin - Facebook’s language identification model
CLD3 - Compact Language Detector 3
langdetect - Python langdetect library
lingua - High-accuracy language detector
manual - Human language identification

Include threshold values when applicable.

estimated_noise_level

Provide honest noise estimates:

Level	Typical Characteristics
`low`	Professional content, editorial review, minimal errors
`medium`	Web content with standard quality, some errors acceptable
`high`	User-generated content, OCR output, minimal curation
`unknown`	Not systematically assessed

manual_spotcheck

Manual review adds credibility to quality claims:

{
  "manual_spotcheck": {
    "performed": true,
    "sample_size": 1000,
    "sampling_method": "random-per-language-stratum",
    "findings": "Overall quality acceptable. Hausa subset: 92% correct language, 85% grammatically correct. English subset: 98% correct language, 95% grammatically correct. Common issues: informal abbreviations, missing diacritics."
  }
}

Sampling methods:

random - Simple random sample
stratified-by-source - Proportional to source distribution
stratified-by-language - Proportional to language distribution
random-per-language-stratum - Fixed sample per language
targeted-edge-cases - Focus on potential problem areas

known_issues

Be transparent about limitations:

{
  "known_issues": [
    "Web scrape subset may contain navigation text fragments",
    "Dates before 2020 have inconsistent formatting",
    "Code-switched examples may be mislabeled by language",
    "Some parallel text pairs are machine-translated rather than human",
    "Audio transcripts contain speaker diarization errors in multi-speaker segments"
  ]
}

Quality Assessment Checklist

When documenting quality, consider:

Language accuracy - Is the language correctly identified?
Text completeness - Are examples complete or truncated?
Encoding quality - Any mojibake or encoding errors?
Content relevance - Does content match stated domains?
Annotation accuracy - For labeled data, are labels correct?
Temporal consistency - Are dates/timestamps accurate?
Duplicate content - Level of near-duplicate content?
Bias indicators - Any systematic content biases?

Quality Section

Fields

Manual Spotcheck Object

Enum Values

estimated_noise_level

Example

Field Details

language_id_method

estimated_noise_level

manual_spotcheck

known_issues

Quality Assessment Checklist

See Also