A hybrid question-answering dataset combining real-world questions with synthetic answers, covering 10 languages with emphasis on factual knowledge and reading comprehension.
A hybrid question-answering dataset combining real-world questions with synthetic answers, covering 10 languages with emphasis on factual knowledge and reading comprehension.
Submit a data access request form at the dataset URL. Academic researchers typically approved within 48 hours. Commercial entities require a separate license agreement.
Provenance
Source Types
crowdsourcedsynthetic-generation
Geography
global
Collection Period
2024-01-01 - 2024-10-15
Notes
Questions were crowdsourced from native speakers. A subset of answers were generated using GPT-4 and validated by human annotators.
Intended Use
Intended Uses
Multilingual QA system training
Reading comprehension research
Cross-lingual transfer learning
Out of Scope
Medical or legal advice systems
Production systems without human oversight
Data Characteristics
Size
Examples
250,000
Tokens
45,000,000
Format
Record Format
jsonl
Data Kind
hybrid
Languages
enesfrdezh-Hansjaarptruhi
Modalities
text
Domains
general scientific technical
Task Types
question-answering text-generation
Synthetic Data
Share of Dataset
35%
Generation Method
GPT-4 was used to generate answers for crowdsourced questions, with human validation and correction.