Sources Section

The sources section provides detailed per-source breakdowns for datasets that aggregate multiple origins or contributors. This is useful for understanding the composition of complex datasets.

Array of Source Objects

Each source object has:

Required Fields

Field	Type	Description
`id`	string	Machine-friendly source identifier
`label`	string	Human-readable source label

Optional Fields

Field	Type	Description
`share_of_dataset`	number	Approximate fraction of examples (0-1)
`data_kind`	enum	Whether this source is “real” or “synthetic”
`origin`	enum	Origin type for this source
`contributor`	string	Key contributor name
`rights`	object	Source-level rights info

Rights Object (per-source)

Field	Type	Description
`license`	string	License if different from dataset-level
`contains_personal_data`	enum	Personal data level for this source

Enum Values

data_kind

real - Real-world data
synthetic - Synthetically generated

origin

web-scrape - Scraped from websites
official-open-data - Government/institutional data
user-upload - User-submitted content
partner-license - Licensed from partners
fieldwork - Collected through fieldwork
synthetic-generation - Generated synthetically
other - Other origin

Example

{
  "sources": [
    {
      "id": "hausa-news-scrape",
      "label": "Nigerian News Websites",
      "share_of_dataset": 0.45,
      "data_kind": "real",
      "origin": "web-scrape",
      "rights": {
        "license": "CC-BY-4.0",
        "contains_personal_data": "de_minimis"
      }
    },
    {
      "id": "govt-publications",
      "label": "Nigerian Government Publications",
      "share_of_dataset": 0.2,
      "data_kind": "real",
      "origin": "official-open-data",
      "contributor": "Nigerian National Archives",
      "rights": {
        "license": "public-domain",
        "contains_personal_data": "none"
      }
    },
    {
      "id": "community-contributions",
      "label": "Community Volunteer Submissions",
      "share_of_dataset": 0.25,
      "data_kind": "real",
      "origin": "user-upload",
      "contributor": "Hausa Language Community Network",
      "rights": {
        "license": "CC-BY-SA-4.0",
        "contains_personal_data": "pseudonymous"
      }
    },
    {
      "id": "backtranslation-augment",
      "label": "Backtranslation Augmentation",
      "share_of_dataset": 0.1,
      "data_kind": "synthetic",
      "origin": "synthetic-generation",
      "rights": {
        "license": "CC-BY-4.0",
        "contains_personal_data": "none"
      }
    }
  ]
}

Field Details

id

Use consistent, machine-friendly identifiers:

{
  "id": "news-scrape-2024"
}

Good patterns:

source-type-year: news-2024, web-2023
contributor-id: partner-acme, community-west
descriptive: legal-docs, social-media

share_of_dataset

Document composition:

{
  "share_of_dataset": 0.35
}

Per-Source Rights

When sources have different licenses:

{
  "sources": [
    {
      "id": "open-source",
      "label": "Open Source Component",
      "rights": {
        "license": "Apache-2.0"
      }
    },
    {
      "id": "cc-content",
      "label": "Creative Commons Content",
      "rights": {
        "license": "CC-BY-4.0"
      }
    }
  ]
}

The overall dataset license (in rights.license) should be the most restrictive applicable license, or a custom license addressing all components.

When to Use Sources

Good Use Cases

Multi-contributor datasets

{
  "sources": [
    { "id": "contributor-a", "contributor": "University A" },
    { "id": "contributor-b", "contributor": "University B" }
  ]
}

Mixed real/synthetic data

{
  "sources": [
    { "id": "real-data", "data_kind": "real" },
    { "id": "augmented", "data_kind": "synthetic" }
  ]
}

Multiple licenses

{
  "sources": [
    { "id": "open", "rights": { "license": "CC0-1.0" } },
    { "id": "attributed", "rights": { "license": "CC-BY-4.0" } }
  ]
}

Different personal data levels

{
  "sources": [
    { "id": "public-docs", "rights": { "contains_personal_data": "none" } },
    { "id": "user-content", "rights": { "contains_personal_data": "pseudonymous" } }
  ]
}

When Not Needed

Skip sources when:

Single homogeneous source
All data has same license and characteristics
Source breakdown would add no useful information

Relationship with Provenance

The provenance section provides overall dataset provenance. The sources section provides per-component detail:

{
  "provenance": {
    "source_types": ["web-scrape", "official-open-data", "user-upload"],
    "geography": ["NG", "GH"],
    "collection_notes": "Aggregated from three source types..."
  },
  "sources": [
    {
      "id": "web",
      "origin": "web-scrape",
      "share_of_dataset": 0.5
    },
    {
      "id": "govt",
      "origin": "official-open-data",
      "share_of_dataset": 0.3
    },
    {
      "id": "community",
      "origin": "user-upload",
      "share_of_dataset": 0.2
    }
  ]
}