Provenance Section

The provenance section documents the origins of your dataset. Understanding data provenance is crucial for assessing quality, potential biases, and appropriateness for specific use cases.

Required Fields

Field	Type	Description
`source_types`	array	Types of sources used to construct the dataset

Optional Fields

Field	Type	Description
`geography`	array	Countries primarily represented (ISO 3166-1 alpha-2 codes)
`collection_start_date`	string (date)	When data collection started (ISO 8601)
`collection_end_date`	string (date)	When data collection ended (ISO 8601)
`collection_notes`	string	Description of collection process, incentives, and known biases

Enum Values

source_types

Value	Description
`web-scrape`	Data scraped from websites
`official-open-data`	Government or institutional open data
`user-upload`	Data voluntarily submitted by users
`partner-license`	Data licensed from partners or vendors
`fieldwork`	Data collected through fieldwork or surveys
`synthetic-generation`	Synthetically generated data
`other`	Other source types

Example

{
  "provenance": {
    "source_types": ["web-scrape", "official-open-data"],
    "geography": ["NG", "GH", "KE"],
    "collection_start_date": "2022-01-01",
    "collection_end_date": "2024-12-31",
    "collection_notes": "Data collected from publicly available news websites and government portals. Sampling prioritized regional diversity and topic coverage. Known bias toward urban content due to greater online presence."
  }
}

Field Details

source_types

Most datasets combine multiple source types. List all that apply:

{
  "source_types": ["web-scrape", "user-upload", "synthetic-generation"]
}

geography

Use ISO 3166-1 alpha-2 country codes. Common examples:

US - United States
GB - United Kingdom
NG - Nigeria
KE - Kenya
IN - India
BR - Brazil

{
  "geography": ["NG", "GH", "CM", "SN"]
}

If the geographic origin is unknown or global, you can omit this field.

collection_notes

Use this field to document:

Collection methodology - How was data gathered?
Sampling strategy - How were sources selected?
Incentive structure - Were contributors paid or incentivized?
Community involvement - Were local communities consulted?
Known biases - What systematic biases exist?

{
  "collection_notes": "Data collected through community volunteer network across 12 Nigerian states. Contributors received mobile airtime credits per verified submission. Over-representation of Kano and Lagos states due to volunteer density. Under-representation of rural communities due to limited internet access."
}

Temporal Coverage

The collection_start_date and collection_end_date fields help users understand:

Dataset currency (how recent is the data?)
Temporal scope (what time period does it cover?)
Update frequency (is this a snapshot or ongoing collection?)

{
  "collection_start_date": "2020-01-01",
  "collection_end_date": "2024-06-30"
}

Multi-Source Datasets

For datasets that aggregate multiple sources with different provenance characteristics, consider using the sources section for per-source breakdowns:

{
  "provenance": {
    "source_types": ["web-scrape", "official-open-data", "fieldwork"],
    "geography": ["NG", "GH"],
    "collection_notes": "Aggregated from three sources with different collection methods."
  },
  "sources": [
    {
      "id": "news-scrape",
      "label": "News Website Scrape",
      "share_of_dataset": 0.6,
      "origin": "web-scrape"
    },
    {
      "id": "govt-data",
      "label": "Government Portal",
      "share_of_dataset": 0.25,
      "origin": "official-open-data"
    },
    {
      "id": "field-recordings",
      "label": "Field Recordings",
      "share_of_dataset": 0.15,
      "origin": "fieldwork"
    }
  ]
}

Provenance Section

Required Fields

Optional Fields

Enum Values

source_types

Example

Field Details

source_types

geography

collection_notes

Temporal Coverage

Multi-Source Datasets

See Also