Processing Section
The processing section documents the transformations applied to convert raw data into the released dataset. This includes normalization, filtering, deduplication, decontamination, and PII handling.
Fields
All fields in the processing section are optional.
| Field | Type | Description |
|---|---|---|
normalization | array | Normalization steps applied |
filters | array | Filters applied to remove examples |
deduplication | object | Deduplication details |
decontamination | object | Benchmark decontamination steps |
pii_redaction | object | PII redaction/anonymization steps |
Deduplication Object
| Field | Type | Description |
|---|---|---|
performed | boolean | Whether deduplication was performed |
method | string | Method used (e.g., “exact-hash”, “minhash”, “simhash”) |
approx_removed_fraction | number | Approximate fraction removed (0-1) |
notes | string | Additional notes |
Decontamination Object
| Field | Type | Description |
|---|---|---|
performed | boolean | Whether decontamination was performed |
targets | array | Benchmarks checked against |
method | string | Decontamination method |
notes | string | Additional notes |
PII Redaction Object
| Field | Type | Description |
|---|---|---|
applied | boolean | Whether PII redaction was applied |
method | string | Redaction method |
notes | string | Additional notes |
Example
{
"processing": {
"normalization": ["unicode-nfc", "whitespace-normalize", "lowercase-metadata-fields"],
"filters": [
"language-id-threshold-0.8",
"min-length-50-chars",
"max-length-50000-chars",
"profanity-filter-high-confidence"
],
"deduplication": {
"performed": true,
"method": "minhash-lsh",
"approx_removed_fraction": 0.15,
"notes": "Used 128 hash functions with Jaccard threshold 0.8"
},
"decontamination": {
"performed": true,
"targets": ["MMLU", "HellaSwag", "ARC", "WinoGrande"],
"method": "13-gram-overlap",
"notes": "Removed examples with >50% 13-gram overlap with test sets"
},
"pii_redaction": {
"applied": true,
"method": "presidio-analyzer",
"notes": "Email, phone, SSN, and credit card patterns replaced with [REDACTED]. Names retained for news articles about public figures."
}
}
}
Field Details
normalization
Common normalization steps:
| Step | Description |
|---|---|
unicode-nfc | Unicode NFC normalization |
unicode-nfkc | Unicode NFKC normalization |
whitespace-normalize | Collapse multiple spaces, normalize line endings |
lowercase | Convert to lowercase |
strip-html | Remove HTML tags |
fix-encoding | Repair encoding issues |
diacritics-preserve | Preserve diacritical marks (important for some languages) |
filters
Common filtering criteria:
| Filter | Description |
|---|---|
language-id-threshold-X | Keep examples with language confidence >= X |
min-length-N-chars | Remove examples shorter than N characters |
max-length-N-chars | Remove examples longer than N characters |
min-words-N | Remove examples with fewer than N words |
profanity-filter | Remove high-toxicity content |
quality-score-threshold-X | Keep examples with quality score >= X |
duplicate-url-removal | Remove duplicate source URLs |
boilerplate-removal | Remove navigation, footers, ads |
deduplication
Document deduplication to help users understand data diversity:
{
"deduplication": {
"performed": true,
"method": "exact-hash",
"approx_removed_fraction": 0.05,
"notes": "SHA-256 hash of normalized text. Exact duplicates only."
}
}
Common methods:
exact-hash- Remove identical examplesminhash-lsh- MinHash with Locality Sensitive Hashingsimhash- SimHash for near-duplicate detectionsuffix-array- Suffix array deduplicationbloom-filter- Bloom filter for approximate deduplication
decontamination
Decontamination prevents benchmark data leakage:
{
"decontamination": {
"performed": true,
"targets": ["MMLU", "HellaSwag", "ARC-Challenge", "WinoGrande", "TruthfulQA", "GSM8K"],
"method": "13-gram-overlap",
"notes": "Following EleutherAI methodology"
}
}
pii_redaction
Document PII handling for compliance and trust:
{
"pii_redaction": {
"applied": true,
"method": "custom-regex-plus-ner",
"notes": "Emails and phones replaced with [EMAIL] and [PHONE]. Names of private individuals replaced with synthetic names. Public figure names retained."
}
}
Common methods:
presidio- Microsoft Presidiospacy-ner- spaCy named entity recognitionregex-patterns- Regular expression matchingmanual-review- Human review and redaction
See Also
- Quality Section - Quality assessments
- Rights Section - Personal data declarations
- PII & Consent Guide - Detailed PII handling guidance