Personal Data & Legal Basis
This guide explains how to properly document personal data in your dataset, including privacy levels, lawful bases (consent, ToS, licenses, etc.), and best practices for compliance.
Personal Data Levels
The contains_personal_data field categorizes the level of personal information in your dataset:
| Level | Description | Legal Basis Required? |
|---|---|---|
none | No personal data | No |
de_minimis | Minimal, incidental personal data | Yes |
pseudonymous | Identifiable but anonymized data | Yes |
direct | Directly identifying information | Yes |
None
Use none when your dataset contains no personal information:
{
"rights": {
"contains_personal_data": "none"
}
}
Examples:
- Synthetic data with fictional entities
- Technical documentation
- Code repositories (without author info)
- Aggregated statistics
De Minimis
Use de_minimis for minimal, incidental personal data that poses negligible privacy risk:
{
"rights": {
"contains_personal_data": "de_minimis",
"legal_basis": "publicly-available"
}
}
Examples:
- Public figures mentioned in news articles
- Historical figures in educational content
- Incidental mentions of public officials
- Widely published public information
Pseudonymous
Use pseudonymous when data could identify individuals but has been anonymized:
{
"rights": {
"contains_personal_data": "pseudonymous",
"legal_basis": "explicit-consent",
"legal_basis_notes": "Data collected under IRB approval with informed consent. All identifiers replaced with random IDs. Re-identification risk assessed as low."
}
}
Examples:
- De-identified medical records
- User behavior data with hashed IDs
- Survey responses with demographics
- Social media data with usernames removed
Direct
Use direct when data directly identifies individuals:
{
"rights": {
"contains_personal_data": "direct",
"legal_basis": "explicit-consent",
"legal_basis_notes": "All participants provided written informed consent for data collection and research use. Consent forms available upon request to approved researchers."
}
}
Examples:
- Named interview transcripts
- Biometric data (faces, voices)
- Contact information
- Government ID numbers
Legal Basis
When contains_personal_data is not none or child_data is true, you must provide legal_basis. Pick the enum
value that matches your situation and add legal_basis_notes for details (required when legal_basis is other):
{
"rights": {
"contains_personal_data": "pseudonymous",
"legal_basis": "explicit-consent",
"legal_basis_notes": "Description of how consent was obtained, what uses were disclosed, and withdrawal rights."
}
}
What to Include
Use legal_basis_notes to provide concise evidence:
Collection Method
How was data collected? (explicit consent, ToS acceptance, public sources, license)
Scope of Use
Which uses are permitted (research, commercial training, redistribution)?
Review Process
Oversight or approvals (IRB, ethics review, legal review)
Rights & Withdrawal
How subjects can withdraw, retention period, deletion process
Legal Basis Examples
IRB-Approved Research
"legal_basis": "explicit-consent",
"legal_basis_notes": "Data collected under IRB protocol #2024-001. Participants provided written consent covering research and model training. Withdrawal honored within 30 days."
Terms of Service
"legal_basis": "terms-of-service",
"legal_basis_notes": "Users accepted ToS (Section 4.2) permitting anonymized data use for research and model training. Opt-out available at datapass.meetkai.ai/legal/privacy."
Public Domain Collection
"legal_basis": "public-domain",
"legal_basis_notes": "Data sourced from public domain materials with no reasonable expectation of privacy. Collection limited to publicly posted content."
Licensed Dataset
"legal_basis": "licensed",
"legal_basis_notes": "Data licensed from Example Corp under DLA-2024-789. Original consent handled by licensor; sublicense permits research/model training with attribution."
Child Data
The child_data field indicates whether your dataset contains data from minors (under 18):
{
"rights": {
"contains_personal_data": "direct",
"child_data": true,
"legal_basis": "explicit-consent",
"legal_basis_notes": "Parental consent obtained for all participants under 18. COPPA-compliant collection process with verified parental authorization."
}
}
Child Data Considerations
- Requires parental or guardian consent
- Subject to stricter regulations (COPPA, GDPR-K)
- Should document age verification methods
- Consider additional access restrictions
Restricted Uses
Document uses that are prohibited due to privacy concerns:
{
"rights": {
"contains_personal_data": "pseudonymous",
"legal_basis": "explicit-consent",
"legal_basis_notes": "...",
"restricted_uses": [
"Re-identification of individuals",
"Profiling or surveillance",
"Marketing to individuals",
"Sharing with third parties without approval"
]
}
}
Compliance Frameworks
GDPR (European Union)
For GDPR compliance, document:
{
"rights": {
"contains_personal_data": "pseudonymous",
"legal_basis": "explicit-consent",
"legal_basis_notes": "Processed under GDPR Article 6(1)(a) with explicit consent. Subjects informed of purposes, 5-year retention, erasure rights, and complaint process. DPO: dpo@example.org."
}
}
Key GDPR requirements:
- Explicit consent for sensitive data
- Right to erasure (deletion requests)
- Data minimization
- Purpose limitation
- Documented lawful basis
HIPAA (United States Healthcare)
For HIPAA-covered health data:
{
"rights": {
"contains_personal_data": "pseudonymous",
"legal_basis": "explicit-consent",
"legal_basis_notes": "Data de-identified per HIPAA Safe Harbor (45 CFR 164.514). All 18 PHI identifiers removed. Expert determination available upon request. Original data collected under patient authorization for research use."
}
}
CCPA (California)
For California consumer data:
{
"rights": {
"contains_personal_data": "pseudonymous",
"legal_basis": "terms-of-service",
"legal_basis_notes": "CCPA-compliant notice at collection. Consumers provided right to know, delete, and opt-out of sale. No personal information sold. Privacy policy: datapass.meetkai.ai/legal/privacy"
}
}
De-Identification Best Practices
When documenting de-identified data:
- State the method - HIPAA Safe Harbor, k-anonymity, differential privacy 2. List removed identifiers - Names, addresses, dates, IDs, etc. 3. Assess re-identification risk - Low, medium, or describe assessment 4. Note any retained quasi-identifiers - Demographics, locations, dates
{
"rights": {
"contains_personal_data": "pseudonymous",
"legal_basis": "explicit-consent",
"legal_basis_notes": "De-identification performed using k-anonymity (k=5). Removed: names, exact addresses, dates of birth, SSN, phone numbers, email. Retained: age ranges (5-year bins), state-level geography, gender. Re-identification risk assessed as low."
},
"safety": {
"content_risk_level": "medium",
"known_risky_categories": ["other"],
"risky_categories_notes": "Residual privacy risk from quasi-identifiers.",
"mitigations": "Quasi-identifiers generalized to prevent linkage attacks. Annual privacy audit scheduled."
}
}
Decision Tree
Use this guide to determine your personal data level:
Does the data contain any information about identifiable individuals?
├── No → contains_personal_data: "none"
└── Yes → Is it incidental mentions of public figures only?
├── Yes → contains_personal_data: "de_minimis"
└── No → Has identifying information been removed?
├── Yes → contains_personal_data: "pseudonymous"
│ (requires legal_basis)
└── No → contains_personal_data: "direct"
(requires legal_basis)
Related Documentation
- Rights Reference - All rights section fields
- Safety Reference - Risk assessment fields
- Restricted Access Example - Medical imaging with PII
- Conditional Rules - All if/then requirements