Personal Data & Legal Basis

This guide explains how to properly document personal data in your dataset, including privacy levels, lawful bases (consent, ToS, licenses, etc.), and best practices for compliance.

Personal Data Levels

The contains_personal_data field categorizes the level of personal information in your dataset:

LevelDescriptionLegal Basis Required?
noneNo personal dataNo
de_minimisMinimal, incidental personal dataYes
pseudonymousIdentifiable but anonymized dataYes
directDirectly identifying informationYes

None

Use none when your dataset contains no personal information:

{
  "rights": {
    "contains_personal_data": "none"
  }
}

Examples:

  • Synthetic data with fictional entities
  • Technical documentation
  • Code repositories (without author info)
  • Aggregated statistics

De Minimis

Use de_minimis for minimal, incidental personal data that poses negligible privacy risk:

{
  "rights": {
    "contains_personal_data": "de_minimis",
    "legal_basis": "publicly-available"
  }
}

Examples:

  • Public figures mentioned in news articles
  • Historical figures in educational content
  • Incidental mentions of public officials
  • Widely published public information

Pseudonymous

Use pseudonymous when data could identify individuals but has been anonymized:

{
  "rights": {
    "contains_personal_data": "pseudonymous",
    "legal_basis": "explicit-consent",
    "legal_basis_notes": "Data collected under IRB approval with informed consent. All identifiers replaced with random IDs. Re-identification risk assessed as low."
  }
}

Examples:

  • De-identified medical records
  • User behavior data with hashed IDs
  • Survey responses with demographics
  • Social media data with usernames removed

Direct

Use direct when data directly identifies individuals:

{
  "rights": {
    "contains_personal_data": "direct",
    "legal_basis": "explicit-consent",
    "legal_basis_notes": "All participants provided written informed consent for data collection and research use. Consent forms available upon request to approved researchers."
  }
}

Examples:

  • Named interview transcripts
  • Biometric data (faces, voices)
  • Contact information
  • Government ID numbers

When contains_personal_data is not none or child_data is true, you must provide legal_basis. Pick the enum value that matches your situation and add legal_basis_notes for details (required when legal_basis is other):

{
  "rights": {
    "contains_personal_data": "pseudonymous",
    "legal_basis": "explicit-consent",
    "legal_basis_notes": "Description of how consent was obtained, what uses were disclosed, and withdrawal rights."
  }
}

What to Include

Use legal_basis_notes to provide concise evidence:

Collection Method

How was data collected? (explicit consent, ToS acceptance, public sources, license)

Scope of Use

Which uses are permitted (research, commercial training, redistribution)?

Review Process

Oversight or approvals (IRB, ethics review, legal review)

Rights & Withdrawal

How subjects can withdraw, retention period, deletion process

IRB-Approved Research

"legal_basis": "explicit-consent",
"legal_basis_notes": "Data collected under IRB protocol #2024-001. Participants provided written consent covering research and model training. Withdrawal honored within 30 days."

Terms of Service

"legal_basis": "terms-of-service",
"legal_basis_notes": "Users accepted ToS (Section 4.2) permitting anonymized data use for research and model training. Opt-out available at datapass.meetkai.ai/legal/privacy."

Public Domain Collection

"legal_basis": "public-domain",
"legal_basis_notes": "Data sourced from public domain materials with no reasonable expectation of privacy. Collection limited to publicly posted content."

Licensed Dataset

"legal_basis": "licensed",
"legal_basis_notes": "Data licensed from Example Corp under DLA-2024-789. Original consent handled by licensor; sublicense permits research/model training with attribution."

Child Data

The child_data field indicates whether your dataset contains data from minors (under 18):

{
  "rights": {
    "contains_personal_data": "direct",
    "child_data": true,
    "legal_basis": "explicit-consent",
    "legal_basis_notes": "Parental consent obtained for all participants under 18. COPPA-compliant collection process with verified parental authorization."
  }
}

Child Data Considerations

  • Requires parental or guardian consent
  • Subject to stricter regulations (COPPA, GDPR-K)
  • Should document age verification methods
  • Consider additional access restrictions

Restricted Uses

Document uses that are prohibited due to privacy concerns:

{
  "rights": {
    "contains_personal_data": "pseudonymous",
    "legal_basis": "explicit-consent",
    "legal_basis_notes": "...",
    "restricted_uses": [
      "Re-identification of individuals",
      "Profiling or surveillance",
      "Marketing to individuals",
      "Sharing with third parties without approval"
    ]
  }
}

Compliance Frameworks

GDPR (European Union)

For GDPR compliance, document:

{
  "rights": {
    "contains_personal_data": "pseudonymous",
    "legal_basis": "explicit-consent",
    "legal_basis_notes": "Processed under GDPR Article 6(1)(a) with explicit consent. Subjects informed of purposes, 5-year retention, erasure rights, and complaint process. DPO: dpo@example.org."
  }
}

Key GDPR requirements:

  • Explicit consent for sensitive data
  • Right to erasure (deletion requests)
  • Data minimization
  • Purpose limitation
  • Documented lawful basis

HIPAA (United States Healthcare)

For HIPAA-covered health data:

{
  "rights": {
    "contains_personal_data": "pseudonymous",
    "legal_basis": "explicit-consent",
    "legal_basis_notes": "Data de-identified per HIPAA Safe Harbor (45 CFR 164.514). All 18 PHI identifiers removed. Expert determination available upon request. Original data collected under patient authorization for research use."
  }
}

CCPA (California)

For California consumer data:

{
  "rights": {
    "contains_personal_data": "pseudonymous",
    "legal_basis": "terms-of-service",
    "legal_basis_notes": "CCPA-compliant notice at collection. Consumers provided right to know, delete, and opt-out of sale. No personal information sold. Privacy policy: datapass.meetkai.ai/legal/privacy"
  }
}

De-Identification Best Practices

When documenting de-identified data:

  1. State the method - HIPAA Safe Harbor, k-anonymity, differential privacy 2. List removed identifiers - Names, addresses, dates, IDs, etc. 3. Assess re-identification risk - Low, medium, or describe assessment 4. Note any retained quasi-identifiers - Demographics, locations, dates
{
  "rights": {
    "contains_personal_data": "pseudonymous",
    "legal_basis": "explicit-consent",
    "legal_basis_notes": "De-identification performed using k-anonymity (k=5). Removed: names, exact addresses, dates of birth, SSN, phone numbers, email. Retained: age ranges (5-year bins), state-level geography, gender. Re-identification risk assessed as low."
  },
  "safety": {
    "content_risk_level": "medium",
    "known_risky_categories": ["other"],
    "risky_categories_notes": "Residual privacy risk from quasi-identifiers.",
    "mitigations": "Quasi-identifiers generalized to prevent linkage attacks. Annual privacy audit scheduled."
  }
}

Decision Tree

Use this guide to determine your personal data level:

Does the data contain any information about identifiable individuals?
├── No → contains_personal_data: "none"
└── Yes → Is it incidental mentions of public figures only?
    ├── Yes → contains_personal_data: "de_minimis"
    └── No → Has identifying information been removed?
        ├── Yes → contains_personal_data: "pseudonymous"
        │         (requires legal_basis)
        └── No → contains_personal_data: "direct"
                  (requires legal_basis)