Extensions Section

The extensions section provides a place for vendor-specific or project-specific metadata that doesn’t fit in the standard schema. This enables custom tooling while maintaining schema compatibility.

Schema

{
  "extensions": {
    "type": "object",
    "description": "Optional vendor- or project-specific extensions.",
    "additionalProperties": true
  }
}

Example

{
  "extensions": {
    "huggingface": {
      "dataset_id": "example-org/hausa-news-corpus",
      "config_name": "default",
      "features": {
        "text": "string",
        "source": "string",
        "date": "string"
      },
      "splits": ["train", "validation", "test"]
    },
    "internal": {
      "project_code": "LANG-2024-042",
      "cost_center": "research-nlp",
      "review_ticket": "REVIEW-1234"
    },
    "mlflow": {
      "experiment_id": "exp-hausa-pretrain",
      "run_id": "run-20250115-001"
    }
  }
}

Common Extension Patterns

Platform-Specific Metadata

{
  "extensions": {
    "huggingface": {
      "dataset_id": "org/dataset-name",
      "viewer_enabled": true,
      "size_category": "10K<n<100K"
    },
    "kaggle": {
      "competition_id": "hausa-nlp-challenge",
      "kernel_count": 45
    },
    "paperswithcode": {
      "dataset_url": "https://paperswithcode.com/dataset/hausa-news"
    }
  }
}

Internal Tracking

{
  "extensions": {
    "internal": {
      "project_id": "PRJ-2025-001",
      "budget_code": "R&D-NLP",
      "approved_by": "data-governance-team",
      "approval_date": "2025-01-10",
      "retention_policy": "7-years",
      "data_classification": "internal-use"
    }
  }
}

Experiment Tracking

{
  "extensions": {
    "wandb": {
      "entity": "research-team",
      "project": "multilingual-lm",
      "artifact_name": "hausa-corpus:v2"
    },
    "mlflow": {
      "tracking_uri": "https://mlflow.example.org",
      "experiment_name": "hausa-pretrain",
      "registered_model": "hausa-lm-base"
    }
  }
}

Custom Quality Metrics

{
  "extensions": {
    "quality_extended": {
      "custom_scorer_version": "1.2.0",
      "coherence_score": 0.85,
      "factuality_sample_check": {
        "sample_size": 100,
        "accuracy": 0.92
      },
      "bias_audit": {
        "performed": true,
        "report_url": "https://example.org/bias-report.pdf"
      }
    }
  }
}

Dataset Lineage

{
  "extensions": {
    "lineage": {
      "parent_datasets": ["common-crawl:CC-MAIN-2024-05", "wikipedia:20240101"],
      "processing_pipeline": "pipeline-v3.2",
      "pipeline_commit": "abc123def",
      "derived_datasets": ["hausa-news-cleaned:1.1", "hausa-news-translated:1.0"]
    }
  }
}

Regional Compliance

{
  "extensions": {
    "compliance": {
      "gdpr": {
        "applicable": false,
        "reason": "No EU personal data"
      },
      "ccpa": {
        "applicable": false,
        "reason": "No California resident data"
      },
      "ndpa": {
        "applicable": true,
        "compliance_status": "compliant",
        "review_date": "2025-01-15"
      }
    }
  }
}

Best Practices

Use Namespaces

Group related extensions under descriptive keys:

{
  "extensions": {
    "huggingface": { "..." },
    "internal": { "..." },
    "quality": { "..." }
  }
}

Avoid flat structures:

{
  "extensions": {
    "hf_dataset_id": "...",
    "internal_project_id": "...",
    "quality_score": "..."
  }
}

Document Your Extensions

If you use custom extensions, document them:

{
  "extensions": {
    "_schema": "https://example.org/datacard-extensions/v1.json",
    "_docs": "https://example.org/datacard-extensions/docs",
    "custom_field": "value"
  }
}

Version Your Extensions

Include version info for custom schemas:

{
  "extensions": {
    "acme_corp": {
      "_version": "2.0",
      "department": "research",
      "cost_tracking_id": "CT-2025-001"
    }
  }
}

Interoperability

Don’t Duplicate Core Fields

Put standard metadata in standard fields:

{
  "core": {
    "id": "my-dataset"
  },
  "extensions": {
    "internal_id": "my-dataset"
  }
}

Use Extensions for Platform-Specific Features

{
  "access": {
    "availability": "public-download",
    "url": "https://huggingface.co/datasets/org/dataset"
  },
  "extensions": {
    "huggingface": {
      "gated": false,
      "viewer": true,
      "library": "datasets"
    }
  }
}

See Also