Personal Data & Legal Basis

This guide explains how to properly document personal data in your dataset, including privacy levels, lawful bases (consent, ToS, licenses, etc.), and best practices for compliance.

Personal Data Levels

The contains_personal_data field categorizes the level of personal information in your dataset:

Level	Description	Legal Basis Required?
`none`	No personal data	No
`de_minimis`	Minimal, incidental personal data	Yes
`pseudonymous`	Identifiable but anonymized data	Yes
`direct`	Directly identifying information	Yes

None

Use none when your dataset contains no personal information:

{
  "rights": {
    "contains_personal_data": "none"
  }
}

Examples:

Synthetic data with fictional entities
Technical documentation
Code repositories (without author info)
Aggregated statistics

De Minimis

Use de_minimis for minimal, incidental personal data that poses negligible privacy risk:

{
  "rights": {
    "contains_personal_data": "de_minimis",
    "legal_basis": "publicly-available"
  }
}

Examples:

Public figures mentioned in news articles
Historical figures in educational content
Incidental mentions of public officials
Widely published public information

Pseudonymous

Use pseudonymous when data could identify individuals but has been anonymized:

{
  "rights": {
    "contains_personal_data": "pseudonymous",
    "legal_basis": "explicit-consent",
    "legal_basis_notes": "Data collected under IRB approval with informed consent. All identifiers replaced with random IDs. Re-identification risk assessed as low."
  }
}

Examples:

De-identified medical records
User behavior data with hashed IDs
Survey responses with demographics
Social media data with usernames removed

Direct

Use direct when data directly identifies individuals:

{
  "rights": {
    "contains_personal_data": "direct",
    "legal_basis": "explicit-consent",
    "legal_basis_notes": "All participants provided written informed consent for data collection and research use. Consent forms available upon request to approved researchers."
  }
}

Examples:

Named interview transcripts
Biometric data (faces, voices)
Contact information
Government ID numbers

Legal Basis

When contains_personal_data is not none or child_data is true, you must provide legal_basis. Pick the enum value that matches your situation and add legal_basis_notes for details (required when legal_basis is other):

{
  "rights": {
    "contains_personal_data": "pseudonymous",
    "legal_basis": "explicit-consent",
    "legal_basis_notes": "Description of how consent was obtained, what uses were disclosed, and withdrawal rights."
  }
}

What to Include

Use legal_basis_notes to provide concise evidence:

Collection Method

How was data collected? (explicit consent, ToS acceptance, public sources, license)

Scope of Use

Which uses are permitted (research, commercial training, redistribution)?

Review Process

Oversight or approvals (IRB, ethics review, legal review)

Rights & Withdrawal

How subjects can withdraw, retention period, deletion process

Legal Basis Examples

IRB-Approved Research

"legal_basis": "explicit-consent",
"legal_basis_notes": "Data collected under IRB protocol #2024-001. Participants provided written consent covering research and model training. Withdrawal honored within 30 days."

Terms of Service

"legal_basis": "terms-of-service",
"legal_basis_notes": "Users accepted ToS (Section 4.2) permitting anonymized data use for research and model training. Opt-out available at datapass.meetkai.ai/legal/privacy."

Public Domain Collection

"legal_basis": "public-domain",
"legal_basis_notes": "Data sourced from public domain materials with no reasonable expectation of privacy. Collection limited to publicly posted content."

Licensed Dataset

"legal_basis": "licensed",
"legal_basis_notes": "Data licensed from Example Corp under DLA-2024-789. Original consent handled by licensor; sublicense permits research/model training with attribution."

Child Data

The child_data field indicates whether your dataset contains data from minors (under 18):

{
  "rights": {
    "contains_personal_data": "direct",
    "child_data": true,
    "legal_basis": "explicit-consent",
    "legal_basis_notes": "Parental consent obtained for all participants under 18. COPPA-compliant collection process with verified parental authorization."
  }
}

Child Data Considerations

Requires parental or guardian consent
Subject to stricter regulations (COPPA, GDPR-K)
Should document age verification methods
Consider additional access restrictions

Restricted Uses

Document uses that are prohibited due to privacy concerns:

{
  "rights": {
    "contains_personal_data": "pseudonymous",
    "legal_basis": "explicit-consent",
    "legal_basis_notes": "...",
    "restricted_uses": [
      "Re-identification of individuals",
      "Profiling or surveillance",
      "Marketing to individuals",
      "Sharing with third parties without approval"
    ]
  }
}

Compliance Frameworks

For GDPR compliance, document:

{
  "rights": {
    "contains_personal_data": "pseudonymous",
    "legal_basis": "explicit-consent",
    "legal_basis_notes": "Processed under GDPR Article 6(1)(a) with explicit consent. Subjects informed of purposes, 5-year retention, erasure rights, and complaint process. DPO: dpo@example.org."
  }
}

Key GDPR requirements:

Explicit consent for sensitive data
Right to erasure (deletion requests)
Data minimization
Purpose limitation
Documented lawful basis

HIPAA (United States Healthcare)

For HIPAA-covered health data:

{
  "rights": {
    "contains_personal_data": "pseudonymous",
    "legal_basis": "explicit-consent",
    "legal_basis_notes": "Data de-identified per HIPAA Safe Harbor (45 CFR 164.514). All 18 PHI identifiers removed. Expert determination available upon request. Original data collected under patient authorization for research use."
  }
}

CCPA (California)

For California consumer data:

{
  "rights": {
    "contains_personal_data": "pseudonymous",
    "legal_basis": "terms-of-service",
    "legal_basis_notes": "CCPA-compliant notice at collection. Consumers provided right to know, delete, and opt-out of sale. No personal information sold. Privacy policy: datapass.meetkai.ai/legal/privacy"
  }
}

De-Identification Best Practices

When documenting de-identified data:

State the method - HIPAA Safe Harbor, k-anonymity, differential privacy 2. List removed identifiers - Names, addresses, dates, IDs, etc. 3. Assess re-identification risk - Low, medium, or describe assessment 4. Note any retained quasi-identifiers - Demographics, locations, dates

{
  "rights": {
    "contains_personal_data": "pseudonymous",
    "legal_basis": "explicit-consent",
    "legal_basis_notes": "De-identification performed using k-anonymity (k=5). Removed: names, exact addresses, dates of birth, SSN, phone numbers, email. Retained: age ranges (5-year bins), state-level geography, gender. Re-identification risk assessed as low."
  },
  "safety": {
    "content_risk_level": "medium",
    "known_risky_categories": ["other"],
    "risky_categories_notes": "Residual privacy risk from quasi-identifiers.",
    "mitigations": "Quasi-identifiers generalized to prevent linkage attacks. Annual privacy audit scheduled."
  }
}

Decision Tree

Use this guide to determine your personal data level:

Does the data contain any information about identifiable individuals?
├── No → contains_personal_data: "none"
└── Yes → Is it incidental mentions of public figures only?
    ├── Yes → contains_personal_data: "de_minimis"
    └── No → Has identifying information been removed?
        ├── Yes → contains_personal_data: "pseudonymous"
        │         (requires legal_basis)
        └── No → contains_personal_data: "direct"
                  (requires legal_basis)

Rights Reference - All rights section fields
Safety Reference - Risk assessment fields
Restricted Access Example - Medical imaging with PII
Conditional Rules - All if/then requirements

Personal Data & Legal Basis

Personal Data Levels

None

De Minimis

Pseudonymous

Direct

Legal Basis

What to Include

Collection Method

Scope of Use

Review Process

Rights & Withdrawal

Legal Basis Examples

IRB-Approved Research

Terms of Service

Public Domain Collection

Licensed Dataset

Child Data

Child Data Considerations

Restricted Uses

Compliance Frameworks

GDPR (European Union)

HIPAA (United States Healthcare)

CCPA (California)

De-Identification Best Practices

Decision Tree

Related Documentation