Sources Section
The sources section provides detailed per-source breakdowns for datasets that aggregate multiple origins or contributors. This is useful for understanding the composition of complex datasets.
Array of Source Objects
Each source object has:
Required Fields
| Field | Type | Description |
|---|---|---|
id | string | Machine-friendly source identifier |
label | string | Human-readable source label |
Optional Fields
| Field | Type | Description |
|---|---|---|
share_of_dataset | number | Approximate fraction of examples (0-1) |
data_kind | enum | Whether this source is “real” or “synthetic” |
origin | enum | Origin type for this source |
contributor | string | Key contributor name |
rights | object | Source-level rights info |
Rights Object (per-source)
| Field | Type | Description |
|---|---|---|
license | string | License if different from dataset-level |
contains_personal_data | enum | Personal data level for this source |
Enum Values
data_kind
real- Real-world datasynthetic- Synthetically generated
origin
web-scrape- Scraped from websitesofficial-open-data- Government/institutional datauser-upload- User-submitted contentpartner-license- Licensed from partnersfieldwork- Collected through fieldworksynthetic-generation- Generated syntheticallyother- Other origin
Example
{
"sources": [
{
"id": "hausa-news-scrape",
"label": "Nigerian News Websites",
"share_of_dataset": 0.45,
"data_kind": "real",
"origin": "web-scrape",
"rights": {
"license": "CC-BY-4.0",
"contains_personal_data": "de_minimis"
}
},
{
"id": "govt-publications",
"label": "Nigerian Government Publications",
"share_of_dataset": 0.2,
"data_kind": "real",
"origin": "official-open-data",
"contributor": "Nigerian National Archives",
"rights": {
"license": "public-domain",
"contains_personal_data": "none"
}
},
{
"id": "community-contributions",
"label": "Community Volunteer Submissions",
"share_of_dataset": 0.25,
"data_kind": "real",
"origin": "user-upload",
"contributor": "Hausa Language Community Network",
"rights": {
"license": "CC-BY-SA-4.0",
"contains_personal_data": "pseudonymous"
}
},
{
"id": "backtranslation-augment",
"label": "Backtranslation Augmentation",
"share_of_dataset": 0.1,
"data_kind": "synthetic",
"origin": "synthetic-generation",
"rights": {
"license": "CC-BY-4.0",
"contains_personal_data": "none"
}
}
]
}
Field Details
id
Use consistent, machine-friendly identifiers:
{
"id": "news-scrape-2024"
}
Good patterns:
source-type-year:news-2024,web-2023contributor-id:partner-acme,community-westdescriptive:legal-docs,social-media
share_of_dataset
Document composition:
{
"share_of_dataset": 0.35
}
Per-Source Rights
When sources have different licenses:
{
"sources": [
{
"id": "open-source",
"label": "Open Source Component",
"rights": {
"license": "Apache-2.0"
}
},
{
"id": "cc-content",
"label": "Creative Commons Content",
"rights": {
"license": "CC-BY-4.0"
}
}
]
}
The overall dataset license (in rights.license) should be the most restrictive applicable license, or a custom license addressing all components.
When to Use Sources
Good Use Cases
-
Multi-contributor datasets
{ "sources": [ { "id": "contributor-a", "contributor": "University A" }, { "id": "contributor-b", "contributor": "University B" } ] } -
Mixed real/synthetic data
{ "sources": [ { "id": "real-data", "data_kind": "real" }, { "id": "augmented", "data_kind": "synthetic" } ] } -
Multiple licenses
{ "sources": [ { "id": "open", "rights": { "license": "CC0-1.0" } }, { "id": "attributed", "rights": { "license": "CC-BY-4.0" } } ] } -
Different personal data levels
{ "sources": [ { "id": "public-docs", "rights": { "contains_personal_data": "none" } }, { "id": "user-content", "rights": { "contains_personal_data": "pseudonymous" } } ] }
When Not Needed
Skip sources when:
- Single homogeneous source
- All data has same license and characteristics
- Source breakdown would add no useful information
Relationship with Provenance
The provenance section provides overall dataset provenance. The sources section provides per-component detail:
{
"provenance": {
"source_types": ["web-scrape", "official-open-data", "user-upload"],
"geography": ["NG", "GH"],
"collection_notes": "Aggregated from three source types..."
},
"sources": [
{
"id": "web",
"origin": "web-scrape",
"share_of_dataset": 0.5
},
{
"id": "govt",
"origin": "official-open-data",
"share_of_dataset": 0.3
},
{
"id": "community",
"origin": "user-upload",
"share_of_dataset": 0.2
}
]
}
See Also
- Provenance Section - Overall data origin
- Rights Section - Dataset-level licensing
- Synthetic Section - Synthetic data details