Validation Rules

The LLM Data Card schema includes conditional validation rules that require certain fields based on the values of other fields. Understanding these rules helps you create valid data cards and avoid validation errors.

How Conditional Rules Work

The schema uses JSON Schema if/then constructs:

{
  "if": { "properties": { "fieldA": { "const": "value" } } },
  "then": { "required": ["fieldB"] }
}

This means: IF fieldA equals "value", THEN fieldB is required.

All Validation Rules

Rule 1: Synthetic Data Requires synthetic Section

Synthetic/Hybrid Data

IF data.kind is "synthetic" or "hybrid"

THEN the synthetic section is required with:

  • generation_method (required)
  • share_of_dataset (required)

Valid Example:

{
  "data": {
    "kind": "hybrid",
    "modalities": ["text"],
    "languages": ["en"],
    "size": { "examples": 100000 },
    "domains": ["general"],
    "record_format": "plain-text"
  },
  "synthetic": {
    "generation_method": "backtranslation",
    "share_of_dataset": 0.3
  }
}

Invalid Example:

{
  "data": {
    "kind": "synthetic",
    "modalities": ["text"],
    "languages": ["en"],
    "size": { "examples": 100000 },
    "domains": ["general"],
    "record_format": "plain-text"
  }
}

Rule 2: Other Record Format Requires Notes

Custom Record Format

IF data.record_format is "other"

THEN data.record_format_notes is required

Valid Example:

{
  "data": {
    "record_format": "other",
    "record_format_notes": "Custom binary format with header, see documentation for parser"
  }
}

Rule 3: Human Annotations Require Label Types

Human Annotations

IF data.has_human_annotations is true

THEN data.label_types is required

Valid Example:

{
  "data": {
    "has_human_annotations": true,
    "label_types": ["sentiment", "topic", "toxicity"]
  }
}

Rule 4: Translation Pairs Require Multiple Languages

Translation Pairs

IF data.structures contains "translation-pairs"

THEN data.languages must have at least 2 items

Valid Example:

{
  "data": {
    "structures": ["translation-pairs"],
    "languages": ["en", "ha-Latn-NG"]
  }
}

Invalid Example:

{
  "data": {
    "structures": ["translation-pairs"],
    "languages": ["en"]
  }
}

Personal Data Present

IF rights.contains_personal_data is "de_minimis", "pseudonymous", or "direct"

THEN rights.consent_mechanism is required

Valid Example:

{
  "rights": {
    "license": "CC-BY-4.0",
    "allows_commercial_use": true,
    "contains_personal_data": "pseudonymous",
    "consent_mechanism": "explicit-opt-in"
  }
}

Child Data

IF rights.child_data is true

THEN rights.consent_mechanism is required

Valid Example:

{
  "rights": {
    "license": "CC-BY-4.0",
    "allows_commercial_use": false,
    "contains_personal_data": "none",
    "child_data": true,
    "consent_mechanism": "parental-consent-obtained"
  }
}

Rule 7: Not Available Requires Reason

Unavailable Dataset

IF access.availability is "not-available"

THEN access.not_available_reason is required

Valid Example:

{
  "access": {
    "availability": "not-available",
    "not_available_reason": "Pending legal review, expected Q2 2025"
  }
}

Rule 8: Restricted/On-Request Requires Instructions or URL

Restricted Access

IF access.availability is "restricted" or "on-request"

THEN either access.request_instructions or access.url is required

Valid Example (with instructions):

{
  "access": {
    "availability": "on-request",
    "request_instructions": "Email data-access@example.org with research proposal"
  }
}

Valid Example (with URL):

{
  "access": {
    "availability": "restricted",
    "url": "https://example.org/apply-for-access"
  }
}

Rule 9: Public Download Requires Download Location

Public Download

IF access.availability is "public-download"

THEN one of the following is required:

  • access.url, OR
  • artifacts.base_uri, OR
  • artifacts.files[].uri

Valid Example (with access.url):

{
  "access": {
    "availability": "public-download",
    "url": "https://huggingface.co/datasets/example/corpus"
  }
}

Valid Example (with artifacts):

{
  "access": {
    "availability": "public-download"
  },
  "artifacts": {
    "base_uri": "https://cdn.example.org/datasets/v1/",
    "files": [
      {
        "path": "train.jsonl",
        "sha256": "abc123...",
        "size_bytes": 1000000,
        "split": "train"
      }
    ]
  }
}

Quick Reference Table

RuleIF conditionTHEN required
1data.kind in [“synthetic”, “hybrid”]synthetic.generation_method, synthetic.share_of_dataset
2data.record_format = “other”data.record_format_notes
3data.has_human_annotations = truedata.label_types
4data.structures contains “translation-pairs”data.languages minItems: 2
5rights.contains_personal_data != “none”rights.consent_mechanism
6rights.child_data = truerights.consent_mechanism
7access.availability = “not-available”access.not_available_reason
8access.availability in [“restricted”, “on-request”]access.request_instructions OR access.url
9access.availability = “public-download”access.url OR artifacts.base_uri OR artifacts.files[].uri

Validation Error Messages

When a conditional rule fails, the validator reports which required field is missing. Common error patterns:

Using the Validator

Test your data card against these rules using the Validator tool. The validator will report all conditional rule violations at once.

See Also