Sources Section

The sources section provides detailed per-source breakdowns for datasets that aggregate multiple origins or contributors. This is useful for understanding the composition of complex datasets.

Array of Source Objects

Each source object has:

Required Fields

FieldTypeDescription
idstringMachine-friendly source identifier
labelstringHuman-readable source label

Optional Fields

FieldTypeDescription
share_of_datasetnumberApproximate fraction of examples (0-1)
data_kindenumWhether this source is “real” or “synthetic”
originenumOrigin type for this source
contributorstringKey contributor name
rightsobjectSource-level rights info

Rights Object (per-source)

FieldTypeDescription
licensestringLicense if different from dataset-level
contains_personal_dataenumPersonal data level for this source

Enum Values

data_kind

  • real - Real-world data
  • synthetic - Synthetically generated

origin

  • web-scrape - Scraped from websites
  • official-open-data - Government/institutional data
  • user-upload - User-submitted content
  • partner-license - Licensed from partners
  • fieldwork - Collected through fieldwork
  • synthetic-generation - Generated synthetically
  • other - Other origin

Example

{
  "sources": [
    {
      "id": "hausa-news-scrape",
      "label": "Nigerian News Websites",
      "share_of_dataset": 0.45,
      "data_kind": "real",
      "origin": "web-scrape",
      "rights": {
        "license": "CC-BY-4.0",
        "contains_personal_data": "de_minimis"
      }
    },
    {
      "id": "govt-publications",
      "label": "Nigerian Government Publications",
      "share_of_dataset": 0.2,
      "data_kind": "real",
      "origin": "official-open-data",
      "contributor": "Nigerian National Archives",
      "rights": {
        "license": "public-domain",
        "contains_personal_data": "none"
      }
    },
    {
      "id": "community-contributions",
      "label": "Community Volunteer Submissions",
      "share_of_dataset": 0.25,
      "data_kind": "real",
      "origin": "user-upload",
      "contributor": "Hausa Language Community Network",
      "rights": {
        "license": "CC-BY-SA-4.0",
        "contains_personal_data": "pseudonymous"
      }
    },
    {
      "id": "backtranslation-augment",
      "label": "Backtranslation Augmentation",
      "share_of_dataset": 0.1,
      "data_kind": "synthetic",
      "origin": "synthetic-generation",
      "rights": {
        "license": "CC-BY-4.0",
        "contains_personal_data": "none"
      }
    }
  ]
}

Field Details

id

Use consistent, machine-friendly identifiers:

{
  "id": "news-scrape-2024"
}

Good patterns:

  • source-type-year: news-2024, web-2023
  • contributor-id: partner-acme, community-west
  • descriptive: legal-docs, social-media

share_of_dataset

Document composition:

{
  "share_of_dataset": 0.35
}

Per-Source Rights

When sources have different licenses:

{
  "sources": [
    {
      "id": "open-source",
      "label": "Open Source Component",
      "rights": {
        "license": "Apache-2.0"
      }
    },
    {
      "id": "cc-content",
      "label": "Creative Commons Content",
      "rights": {
        "license": "CC-BY-4.0"
      }
    }
  ]
}

The overall dataset license (in rights.license) should be the most restrictive applicable license, or a custom license addressing all components.

When to Use Sources

Good Use Cases

  1. Multi-contributor datasets

    {
      "sources": [
        { "id": "contributor-a", "contributor": "University A" },
        { "id": "contributor-b", "contributor": "University B" }
      ]
    }
  2. Mixed real/synthetic data

    {
      "sources": [
        { "id": "real-data", "data_kind": "real" },
        { "id": "augmented", "data_kind": "synthetic" }
      ]
    }
  3. Multiple licenses

    {
      "sources": [
        { "id": "open", "rights": { "license": "CC0-1.0" } },
        { "id": "attributed", "rights": { "license": "CC-BY-4.0" } }
      ]
    }
  4. Different personal data levels

    {
      "sources": [
        { "id": "public-docs", "rights": { "contains_personal_data": "none" } },
        { "id": "user-content", "rights": { "contains_personal_data": "pseudonymous" } }
      ]
    }

When Not Needed

Skip sources when:

  • Single homogeneous source
  • All data has same license and characteristics
  • Source breakdown would add no useful information

Relationship with Provenance

The provenance section provides overall dataset provenance. The sources section provides per-component detail:

{
  "provenance": {
    "source_types": ["web-scrape", "official-open-data", "user-upload"],
    "geography": ["NG", "GH"],
    "collection_notes": "Aggregated from three source types..."
  },
  "sources": [
    {
      "id": "web",
      "origin": "web-scrape",
      "share_of_dataset": 0.5
    },
    {
      "id": "govt",
      "origin": "official-open-data",
      "share_of_dataset": 0.3
    },
    {
      "id": "community",
      "origin": "user-upload",
      "share_of_dataset": 0.2
    }
  ]
}

See Also