Provenance Section
The provenance section documents the origins of your dataset. Understanding data provenance is crucial for assessing quality, potential biases, and appropriateness for specific use cases.
Required Fields
| Field | Type | Description |
|---|---|---|
source_types | array | Types of sources used to construct the dataset |
Optional Fields
| Field | Type | Description |
|---|---|---|
geography | array | Countries primarily represented (ISO 3166-1 alpha-2 codes) |
collection_start_date | string (date) | When data collection started (ISO 8601) |
collection_end_date | string (date) | When data collection ended (ISO 8601) |
collection_notes | string | Description of collection process, incentives, and known biases |
Enum Values
source_types
| Value | Description |
|---|---|
web-scrape | Data scraped from websites |
official-open-data | Government or institutional open data |
user-upload | Data voluntarily submitted by users |
partner-license | Data licensed from partners or vendors |
fieldwork | Data collected through fieldwork or surveys |
synthetic-generation | Synthetically generated data |
other | Other source types |
Example
{
"provenance": {
"source_types": ["web-scrape", "official-open-data"],
"geography": ["NG", "GH", "KE"],
"collection_start_date": "2022-01-01",
"collection_end_date": "2024-12-31",
"collection_notes": "Data collected from publicly available news websites and government portals. Sampling prioritized regional diversity and topic coverage. Known bias toward urban content due to greater online presence."
}
}
Field Details
source_types
Most datasets combine multiple source types. List all that apply:
{
"source_types": ["web-scrape", "user-upload", "synthetic-generation"]
}
geography
Use ISO 3166-1 alpha-2 country codes. Common examples:
US- United StatesGB- United KingdomNG- NigeriaKE- KenyaIN- IndiaBR- Brazil
{
"geography": ["NG", "GH", "CM", "SN"]
}
If the geographic origin is unknown or global, you can omit this field.
collection_notes
Use this field to document:
- Collection methodology - How was data gathered?
- Sampling strategy - How were sources selected?
- Incentive structure - Were contributors paid or incentivized?
- Community involvement - Were local communities consulted?
- Known biases - What systematic biases exist?
{
"collection_notes": "Data collected through community volunteer network across 12 Nigerian states. Contributors received mobile airtime credits per verified submission. Over-representation of Kano and Lagos states due to volunteer density. Under-representation of rural communities due to limited internet access."
}
Temporal Coverage
The collection_start_date and collection_end_date fields help users understand:
- Dataset currency (how recent is the data?)
- Temporal scope (what time period does it cover?)
- Update frequency (is this a snapshot or ongoing collection?)
{
"collection_start_date": "2020-01-01",
"collection_end_date": "2024-06-30"
}
Multi-Source Datasets
For datasets that aggregate multiple sources with different provenance characteristics, consider using the sources section for per-source breakdowns:
{
"provenance": {
"source_types": ["web-scrape", "official-open-data", "fieldwork"],
"geography": ["NG", "GH"],
"collection_notes": "Aggregated from three sources with different collection methods."
},
"sources": [
{
"id": "news-scrape",
"label": "News Website Scrape",
"share_of_dataset": 0.6,
"origin": "web-scrape"
},
{
"id": "govt-data",
"label": "Government Portal",
"share_of_dataset": 0.25,
"origin": "official-open-data"
},
{
"id": "field-recordings",
"label": "Field Recordings",
"share_of_dataset": 0.15,
"origin": "fieldwork"
}
]
}
See Also
- Sources Section - Per-source breakdown for multi-origin datasets
- Community Section - Local community involvement details
- Synthetic Section - Details for synthetic data generation