Processing Section

The processing section documents the transformations applied to convert raw data into the released dataset. This includes normalization, filtering, deduplication, decontamination, and PII handling.

Fields

All fields in the processing section are optional.

FieldTypeDescription
normalizationarrayNormalization steps applied
filtersarrayFilters applied to remove examples
deduplicationobjectDeduplication details
decontaminationobjectBenchmark decontamination steps
pii_redactionobjectPII redaction/anonymization steps

Deduplication Object

FieldTypeDescription
performedbooleanWhether deduplication was performed
methodstringMethod used (e.g., “exact-hash”, “minhash”, “simhash”)
approx_removed_fractionnumberApproximate fraction removed (0-1)
notesstringAdditional notes

Decontamination Object

FieldTypeDescription
performedbooleanWhether decontamination was performed
targetsarrayBenchmarks checked against
methodstringDecontamination method
notesstringAdditional notes

PII Redaction Object

FieldTypeDescription
appliedbooleanWhether PII redaction was applied
methodstringRedaction method
notesstringAdditional notes

Example

{
  "processing": {
    "normalization": ["unicode-nfc", "whitespace-normalize", "lowercase-metadata-fields"],
    "filters": [
      "language-id-threshold-0.8",
      "min-length-50-chars",
      "max-length-50000-chars",
      "profanity-filter-high-confidence"
    ],
    "deduplication": {
      "performed": true,
      "method": "minhash-lsh",
      "approx_removed_fraction": 0.15,
      "notes": "Used 128 hash functions with Jaccard threshold 0.8"
    },
    "decontamination": {
      "performed": true,
      "targets": ["MMLU", "HellaSwag", "ARC", "WinoGrande"],
      "method": "13-gram-overlap",
      "notes": "Removed examples with >50% 13-gram overlap with test sets"
    },
    "pii_redaction": {
      "applied": true,
      "method": "presidio-analyzer",
      "notes": "Email, phone, SSN, and credit card patterns replaced with [REDACTED]. Names retained for news articles about public figures."
    }
  }
}

Field Details

normalization

Common normalization steps:

StepDescription
unicode-nfcUnicode NFC normalization
unicode-nfkcUnicode NFKC normalization
whitespace-normalizeCollapse multiple spaces, normalize line endings
lowercaseConvert to lowercase
strip-htmlRemove HTML tags
fix-encodingRepair encoding issues
diacritics-preservePreserve diacritical marks (important for some languages)

filters

Common filtering criteria:

FilterDescription
language-id-threshold-XKeep examples with language confidence >= X
min-length-N-charsRemove examples shorter than N characters
max-length-N-charsRemove examples longer than N characters
min-words-NRemove examples with fewer than N words
profanity-filterRemove high-toxicity content
quality-score-threshold-XKeep examples with quality score >= X
duplicate-url-removalRemove duplicate source URLs
boilerplate-removalRemove navigation, footers, ads

deduplication

Document deduplication to help users understand data diversity:

{
  "deduplication": {
    "performed": true,
    "method": "exact-hash",
    "approx_removed_fraction": 0.05,
    "notes": "SHA-256 hash of normalized text. Exact duplicates only."
  }
}

Common methods:

  • exact-hash - Remove identical examples
  • minhash-lsh - MinHash with Locality Sensitive Hashing
  • simhash - SimHash for near-duplicate detection
  • suffix-array - Suffix array deduplication
  • bloom-filter - Bloom filter for approximate deduplication

decontamination

Decontamination prevents benchmark data leakage:

{
  "decontamination": {
    "performed": true,
    "targets": ["MMLU", "HellaSwag", "ARC-Challenge", "WinoGrande", "TruthfulQA", "GSM8K"],
    "method": "13-gram-overlap",
    "notes": "Following EleutherAI methodology"
  }
}

pii_redaction

Document PII handling for compliance and trust:

{
  "pii_redaction": {
    "applied": true,
    "method": "custom-regex-plus-ner",
    "notes": "Emails and phones replaced with [EMAIL] and [PHONE]. Names of private individuals replaced with synthetic names. Public figure names retained."
  }
}

Common methods:

  • presidio - Microsoft Presidio
  • spacy-ner - spaCy named entity recognition
  • regex-patterns - Regular expression matching
  • manual-review - Human review and redaction

See Also