Personal Data & Consent
This guide explains how to properly document personal data in your dataset, including privacy levels, consent requirements, and best practices for compliance.
Personal Data Levels
The contains_personal_data field categorizes the level of personal information in your dataset:
| Level | Description | Consent Required? |
|---|---|---|
none | No personal data | No |
de_minimis | Minimal, incidental personal data | No |
pseudonymous | Identifiable but anonymized data | Yes |
direct | Directly identifying information | Yes |
None
Use none when your dataset contains no personal information:
{
"rights": {
"contains_personal_data": "none"
}
}
Examples:
- Synthetic data with fictional entities
- Technical documentation
- Code repositories (without author info)
- Aggregated statistics
De Minimis
Use de_minimis for minimal, incidental personal data that poses negligible privacy risk:
{
"rights": {
"contains_personal_data": "de_minimis"
}
}
Examples:
- Public figures mentioned in news articles
- Historical figures in educational content
- Incidental mentions of public officials
- Widely published public information
Pseudonymous
Use pseudonymous when data could identify individuals but has been anonymized:
{
"rights": {
"contains_personal_data": "pseudonymous",
"consent_mechanism": "Data collected under IRB approval with informed consent. All identifiers replaced with random IDs. Re-identification risk assessed as low."
}
}
Examples:
- De-identified medical records
- User behavior data with hashed IDs
- Survey responses with demographics
- Social media data with usernames removed
Direct
Use direct when data directly identifies individuals:
{
"rights": {
"contains_personal_data": "direct",
"consent_mechanism": "All participants provided written informed consent for data collection and research use. Consent forms available upon request to approved researchers."
}
}
Examples:
- Named interview transcripts
- Biometric data (faces, voices)
- Contact information
- Government ID numbers
Consent Mechanism
When contains_personal_data is not none, you must provide a consent_mechanism explaining how consent was obtained:
{
"rights": {
"contains_personal_data": "pseudonymous",
"consent_mechanism": "Description of how consent was obtained..."
}
}
What to Include
A good consent mechanism description should address:
Collection Method
How was consent obtained? (opt-in, terms acceptance, written form)
Scope of Consent
What uses did subjects consent to? (research, commercial, etc.)
Review Process
Was collection reviewed? (IRB, ethics board, legal review)
Data Subject Rights
Can subjects withdraw? How to contact for removal?
Consent Mechanism Examples
IRB-Approved Research
"consent_mechanism": "Data collected under IRB protocol #2024-001 at Example University. All participants provided written informed consent. Participants were informed of research purposes, data retention policies, and their right to withdraw. IRB approval documentation available to verified researchers."
Terms of Service
"consent_mechanism": "Users consented to data collection via platform Terms of Service (Section 4.2) and Privacy Policy. Users were notified that anonymized data may be used for research. Opt-out mechanism available at datapass.meetkai.ai/legal/privacy."
Public Domain Collection
"consent_mechanism": "Data sourced from public domain materials with no reasonable expectation of privacy. Collection limited to publicly posted content. No private communications included."
Licensed Dataset
"consent_mechanism": "Data licensed from Example Corp under Data License Agreement #DLA-2024-789. Original consent obtained by licensor. Sublicensing terms permit research use with attribution."
Child Data
The child_data field indicates whether your dataset contains data from minors (under 18):
{
"rights": {
"contains_personal_data": "direct",
"child_data": true,
"consent_mechanism": "Parental consent obtained for all participants under 18. COPPA-compliant collection process with verified parental authorization."
}
}
Child Data Considerations
- Requires parental or guardian consent
- Subject to stricter regulations (COPPA, GDPR-K)
- Should document age verification methods
- Consider additional access restrictions
Restricted Uses
Document uses that are prohibited due to privacy concerns:
{
"rights": {
"contains_personal_data": "pseudonymous",
"consent_mechanism": "...",
"restricted_uses": [
"Re-identification of individuals",
"Profiling or surveillance",
"Marketing to individuals",
"Sharing with third parties without approval"
]
}
}
Compliance Frameworks
GDPR (European Union)
For GDPR compliance, document:
{
"rights": {
"contains_personal_data": "pseudonymous",
"consent_mechanism": "Data processed under GDPR Article 6(1)(a) with explicit consent. Data subjects informed of: processing purposes, retention period (5 years), right to erasure, and right to lodge complaints with supervisory authority. DPO contact: dpo@example.org"
}
}
Key GDPR requirements:
- Explicit consent for sensitive data
- Right to erasure (deletion requests)
- Data minimization
- Purpose limitation
- Documented lawful basis
HIPAA (United States Healthcare)
For HIPAA-covered health data:
{
"rights": {
"contains_personal_data": "pseudonymous",
"consent_mechanism": "Data de-identified per HIPAA Safe Harbor method (45 CFR 164.514). All 18 PHI identifiers removed. Expert determination available upon request. Original data collected under patient authorization for research use."
}
}
CCPA (California)
For California consumer data:
{
"rights": {
"contains_personal_data": "pseudonymous",
"consent_mechanism": "Data collected with CCPA-compliant notice at collection. Consumers provided right to know, delete, and opt-out of sale. No personal information sold. Privacy policy: datapass.meetkai.ai/legal/privacy"
}
}
De-Identification Best Practices
When documenting de-identified data:
- State the method - HIPAA Safe Harbor, k-anonymity, differential privacy 2. List removed identifiers - Names, addresses, dates, IDs, etc. 3. Assess re-identification risk - Low, medium, or describe assessment 4. Note any retained quasi-identifiers - Demographics, locations, dates
{
"rights": {
"contains_personal_data": "pseudonymous",
"consent_mechanism": "De-identification performed using k-anonymity (k=5). Removed: names, exact addresses, dates of birth, SSN, phone numbers, email. Retained: age ranges (5-year bins), state-level geography, gender. Re-identification risk assessed by privacy team as low."
},
"safety": {
"content_risk_level": "medium",
"known_risky_categories": ["personal-information"],
"mitigations": "Quasi-identifiers generalized to prevent linkage attacks. Annual privacy audit scheduled."
}
}
Decision Tree
Use this guide to determine your personal data level:
Does the data contain any information about identifiable individuals?
├── No → contains_personal_data: "none"
└── Yes → Is it incidental mentions of public figures only?
├── Yes → contains_personal_data: "de_minimis"
└── No → Has identifying information been removed?
├── Yes → contains_personal_data: "pseudonymous"
│ (requires consent_mechanism)
└── No → contains_personal_data: "direct"
(requires consent_mechanism)
Related Documentation
- Rights Reference - All rights section fields
- Safety Reference - Risk assessment fields
- Restricted Access Example - Medical imaging with PII
- Conditional Rules - All if/then requirements