Personal Data & Consent

This guide explains how to properly document personal data in your dataset, including privacy levels, consent requirements, and best practices for compliance.

Personal Data Levels

The contains_personal_data field categorizes the level of personal information in your dataset:

LevelDescriptionConsent Required?
noneNo personal dataNo
de_minimisMinimal, incidental personal dataNo
pseudonymousIdentifiable but anonymized dataYes
directDirectly identifying informationYes

None

Use none when your dataset contains no personal information:

{
  "rights": {
    "contains_personal_data": "none"
  }
}

Examples:

  • Synthetic data with fictional entities
  • Technical documentation
  • Code repositories (without author info)
  • Aggregated statistics

De Minimis

Use de_minimis for minimal, incidental personal data that poses negligible privacy risk:

{
  "rights": {
    "contains_personal_data": "de_minimis"
  }
}

Examples:

  • Public figures mentioned in news articles
  • Historical figures in educational content
  • Incidental mentions of public officials
  • Widely published public information

Pseudonymous

Use pseudonymous when data could identify individuals but has been anonymized:

{
  "rights": {
    "contains_personal_data": "pseudonymous",
    "consent_mechanism": "Data collected under IRB approval with informed consent. All identifiers replaced with random IDs. Re-identification risk assessed as low."
  }
}

Examples:

  • De-identified medical records
  • User behavior data with hashed IDs
  • Survey responses with demographics
  • Social media data with usernames removed

Direct

Use direct when data directly identifies individuals:

{
  "rights": {
    "contains_personal_data": "direct",
    "consent_mechanism": "All participants provided written informed consent for data collection and research use. Consent forms available upon request to approved researchers."
  }
}

Examples:

  • Named interview transcripts
  • Biometric data (faces, voices)
  • Contact information
  • Government ID numbers

When contains_personal_data is not none, you must provide a consent_mechanism explaining how consent was obtained:

{
  "rights": {
    "contains_personal_data": "pseudonymous",
    "consent_mechanism": "Description of how consent was obtained..."
  }
}

What to Include

A good consent mechanism description should address:

Collection Method

How was consent obtained? (opt-in, terms acceptance, written form)

Scope of Consent

What uses did subjects consent to? (research, commercial, etc.)

Review Process

Was collection reviewed? (IRB, ethics board, legal review)

Data Subject Rights

Can subjects withdraw? How to contact for removal?

IRB-Approved Research

"consent_mechanism": "Data collected under IRB protocol #2024-001 at Example University. All participants provided written informed consent. Participants were informed of research purposes, data retention policies, and their right to withdraw. IRB approval documentation available to verified researchers."

Terms of Service

"consent_mechanism": "Users consented to data collection via platform Terms of Service (Section 4.2) and Privacy Policy. Users were notified that anonymized data may be used for research. Opt-out mechanism available at datapass.meetkai.ai/legal/privacy."

Public Domain Collection

"consent_mechanism": "Data sourced from public domain materials with no reasonable expectation of privacy. Collection limited to publicly posted content. No private communications included."

Licensed Dataset

"consent_mechanism": "Data licensed from Example Corp under Data License Agreement #DLA-2024-789. Original consent obtained by licensor. Sublicensing terms permit research use with attribution."

Child Data

The child_data field indicates whether your dataset contains data from minors (under 18):

{
  "rights": {
    "contains_personal_data": "direct",
    "child_data": true,
    "consent_mechanism": "Parental consent obtained for all participants under 18. COPPA-compliant collection process with verified parental authorization."
  }
}

Child Data Considerations

  • Requires parental or guardian consent
  • Subject to stricter regulations (COPPA, GDPR-K)
  • Should document age verification methods
  • Consider additional access restrictions

Restricted Uses

Document uses that are prohibited due to privacy concerns:

{
  "rights": {
    "contains_personal_data": "pseudonymous",
    "consent_mechanism": "...",
    "restricted_uses": [
      "Re-identification of individuals",
      "Profiling or surveillance",
      "Marketing to individuals",
      "Sharing with third parties without approval"
    ]
  }
}

Compliance Frameworks

GDPR (European Union)

For GDPR compliance, document:

{
  "rights": {
    "contains_personal_data": "pseudonymous",
    "consent_mechanism": "Data processed under GDPR Article 6(1)(a) with explicit consent. Data subjects informed of: processing purposes, retention period (5 years), right to erasure, and right to lodge complaints with supervisory authority. DPO contact: dpo@example.org"
  }
}

Key GDPR requirements:

  • Explicit consent for sensitive data
  • Right to erasure (deletion requests)
  • Data minimization
  • Purpose limitation
  • Documented lawful basis

HIPAA (United States Healthcare)

For HIPAA-covered health data:

{
  "rights": {
    "contains_personal_data": "pseudonymous",
    "consent_mechanism": "Data de-identified per HIPAA Safe Harbor method (45 CFR 164.514). All 18 PHI identifiers removed. Expert determination available upon request. Original data collected under patient authorization for research use."
  }
}

CCPA (California)

For California consumer data:

{
  "rights": {
    "contains_personal_data": "pseudonymous",
    "consent_mechanism": "Data collected with CCPA-compliant notice at collection. Consumers provided right to know, delete, and opt-out of sale. No personal information sold. Privacy policy: datapass.meetkai.ai/legal/privacy"
  }
}

De-Identification Best Practices

When documenting de-identified data:

  1. State the method - HIPAA Safe Harbor, k-anonymity, differential privacy 2. List removed identifiers - Names, addresses, dates, IDs, etc. 3. Assess re-identification risk - Low, medium, or describe assessment 4. Note any retained quasi-identifiers - Demographics, locations, dates
{
  "rights": {
    "contains_personal_data": "pseudonymous",
    "consent_mechanism": "De-identification performed using k-anonymity (k=5). Removed: names, exact addresses, dates of birth, SSN, phone numbers, email. Retained: age ranges (5-year bins), state-level geography, gender. Re-identification risk assessed by privacy team as low."
  },
  "safety": {
    "content_risk_level": "medium",
    "known_risky_categories": ["personal-information"],
    "mitigations": "Quasi-identifiers generalized to prevent linkage attacks. Annual privacy audit scheduled."
  }
}

Decision Tree

Use this guide to determine your personal data level:

Does the data contain any information about identifiable individuals?
├── No → contains_personal_data: "none"
└── Yes → Is it incidental mentions of public figures only?
    ├── Yes → contains_personal_data: "de_minimis"
    └── No → Has identifying information been removed?
        ├── Yes → contains_personal_data: "pseudonymous"
        │         (requires consent_mechanism)
        └── No → contains_personal_data: "direct"
                  (requires consent_mechanism)