Provenance Section

The provenance section documents the origins of your dataset. Understanding data provenance is crucial for assessing quality, potential biases, and appropriateness for specific use cases.

Required Fields

FieldTypeDescription
source_typesarrayTypes of sources used to construct the dataset

Optional Fields

FieldTypeDescription
geographyarrayCountries primarily represented (ISO 3166-1 alpha-2 codes)
collection_start_datestring (date)When data collection started (ISO 8601)
collection_end_datestring (date)When data collection ended (ISO 8601)
collection_notesstringDescription of collection process, incentives, and known biases

Enum Values

source_types

ValueDescription
web-scrapeData scraped from websites
official-open-dataGovernment or institutional open data
user-uploadData voluntarily submitted by users
partner-licenseData licensed from partners or vendors
fieldworkData collected through fieldwork or surveys
synthetic-generationSynthetically generated data
otherOther source types

Example

{
  "provenance": {
    "source_types": ["web-scrape", "official-open-data"],
    "geography": ["NG", "GH", "KE"],
    "collection_start_date": "2022-01-01",
    "collection_end_date": "2024-12-31",
    "collection_notes": "Data collected from publicly available news websites and government portals. Sampling prioritized regional diversity and topic coverage. Known bias toward urban content due to greater online presence."
  }
}

Field Details

source_types

Most datasets combine multiple source types. List all that apply:

{
  "source_types": ["web-scrape", "user-upload", "synthetic-generation"]
}

geography

Use ISO 3166-1 alpha-2 country codes. Common examples:

  • US - United States
  • GB - United Kingdom
  • NG - Nigeria
  • KE - Kenya
  • IN - India
  • BR - Brazil
{
  "geography": ["NG", "GH", "CM", "SN"]
}

If the geographic origin is unknown or global, you can omit this field.

collection_notes

Use this field to document:

  1. Collection methodology - How was data gathered?
  2. Sampling strategy - How were sources selected?
  3. Incentive structure - Were contributors paid or incentivized?
  4. Community involvement - Were local communities consulted?
  5. Known biases - What systematic biases exist?
{
  "collection_notes": "Data collected through community volunteer network across 12 Nigerian states. Contributors received mobile airtime credits per verified submission. Over-representation of Kano and Lagos states due to volunteer density. Under-representation of rural communities due to limited internet access."
}

Temporal Coverage

The collection_start_date and collection_end_date fields help users understand:

  • Dataset currency (how recent is the data?)
  • Temporal scope (what time period does it cover?)
  • Update frequency (is this a snapshot or ongoing collection?)
{
  "collection_start_date": "2020-01-01",
  "collection_end_date": "2024-06-30"
}

Multi-Source Datasets

For datasets that aggregate multiple sources with different provenance characteristics, consider using the sources section for per-source breakdowns:

{
  "provenance": {
    "source_types": ["web-scrape", "official-open-data", "fieldwork"],
    "geography": ["NG", "GH"],
    "collection_notes": "Aggregated from three sources with different collection methods."
  },
  "sources": [
    {
      "id": "news-scrape",
      "label": "News Website Scrape",
      "share_of_dataset": 0.6,
      "origin": "web-scrape"
    },
    {
      "id": "govt-data",
      "label": "Government Portal",
      "share_of_dataset": 0.25,
      "origin": "official-open-data"
    },
    {
      "id": "field-recordings",
      "label": "Field Recordings",
      "share_of_dataset": 0.15,
      "origin": "fieldwork"
    }
  ]
}

See Also