Quality Section
The quality section documents quality-related measurements, language identification methods, noise estimates, manual spot checks, and known issues with the dataset.
Fields
All fields in the quality section are optional.
| Field | Type | Description |
|---|---|---|
language_id_method | string | How language identification was performed |
estimated_noise_level | enum | Subjective noise level estimate |
manual_spotcheck | object | Manual spot check details |
known_issues | array | Known issues or limitations |
Manual Spotcheck Object
| Field | Type | Description |
|---|---|---|
performed | boolean | Whether spot check was performed |
sample_size | integer | Number of examples reviewed |
sampling_method | string | How samples were chosen |
findings | string | Summary of findings |
Enum Values
estimated_noise_level
| Value | Description |
|---|---|
low | Minimal noise, high-quality data |
medium | Some noise present but manageable |
high | Significant noise, may need additional filtering |
unknown | Noise level not assessed |
Example
{
"quality": {
"language_id_method": "fastText lid.176.bin with threshold 0.8",
"estimated_noise_level": "medium",
"manual_spotcheck": {
"performed": true,
"sample_size": 500,
"sampling_method": "stratified-by-source",
"findings": "~5% mislabeled language, ~3% truncated text, ~2% encoding artifacts. Higher noise in social media subset."
},
"known_issues": [
"Social media subset contains informal language and code-switching",
"News subset over-represents political content from election years",
"Some OCR errors in scanned document sources",
"Occasional HTML artifacts in web-scraped content"
]
}
}
Field Details
language_id_method
Document the language identification approach:
{
"language_id_method": "CLD3 with minimum confidence 0.9 and byte threshold 50"
}
Common tools:
fastText lid.176.bin- Facebook’s language identification modelCLD3- Compact Language Detector 3langdetect- Python langdetect librarylingua- High-accuracy language detectormanual- Human language identification
Include threshold values when applicable.
estimated_noise_level
Provide honest noise estimates:
| Level | Typical Characteristics |
|---|---|
low | Professional content, editorial review, minimal errors |
medium | Web content with standard quality, some errors acceptable |
high | User-generated content, OCR output, minimal curation |
unknown | Not systematically assessed |
manual_spotcheck
Manual review adds credibility to quality claims:
{
"manual_spotcheck": {
"performed": true,
"sample_size": 1000,
"sampling_method": "random-per-language-stratum",
"findings": "Overall quality acceptable. Hausa subset: 92% correct language, 85% grammatically correct. English subset: 98% correct language, 95% grammatically correct. Common issues: informal abbreviations, missing diacritics."
}
}
Sampling methods:
random- Simple random samplestratified-by-source- Proportional to source distributionstratified-by-language- Proportional to language distributionrandom-per-language-stratum- Fixed sample per languagetargeted-edge-cases- Focus on potential problem areas
known_issues
Be transparent about limitations:
{
"known_issues": [
"Web scrape subset may contain navigation text fragments",
"Dates before 2020 have inconsistent formatting",
"Code-switched examples may be mislabeled by language",
"Some parallel text pairs are machine-translated rather than human",
"Audio transcripts contain speaker diarization errors in multi-speaker segments"
]
}
Quality Assessment Checklist
When documenting quality, consider:
- Language accuracy - Is the language correctly identified?
- Text completeness - Are examples complete or truncated?
- Encoding quality - Any mojibake or encoding errors?
- Content relevance - Does content match stated domains?
- Annotation accuracy - For labeled data, are labels correct?
- Temporal consistency - Are dates/timestamps accurate?
- Duplicate content - Level of near-duplicate content?
- Bias indicators - Any systematic content biases?
See Also
- Processing Section - Filtering and cleaning steps
- Data Section - Annotation details in
annotation_details - Community Section - Community review status