Validation Rules
The LLM Data Card schema includes conditional validation rules that require certain fields based on the values of other fields. Understanding these rules helps you create valid data cards and avoid validation errors.
How Conditional Rules Work
The schema uses JSON Schema if/then constructs:
{
"if": { "properties": { "fieldA": { "const": "value" } } },
"then": { "required": ["fieldB"] }
}
This means: IF fieldA equals "value", THEN fieldB is required.
All Validation Rules
Rule 1: Synthetic Data Requires synthetic Section
Synthetic/Hybrid Data
IF data.kind is "synthetic" or "hybrid"
THEN the synthetic section is required with:
generation_method(required)share_of_dataset(required)
Valid Example:
{
"data": {
"kind": "hybrid",
"modalities": ["text"],
"languages": ["en"],
"size": { "examples": 100000 },
"domains": ["general"],
"record_format": "plain-text"
},
"synthetic": {
"generation_method": "backtranslation",
"share_of_dataset": 0.3
}
}
Invalid Example:
{
"data": {
"kind": "synthetic",
"modalities": ["text"],
"languages": ["en"],
"size": { "examples": 100000 },
"domains": ["general"],
"record_format": "plain-text"
}
}
Rule 2: Other Record Format Requires Notes
Custom Record Format
IF data.record_format is "other"
THEN data.record_format_notes is required
Valid Example:
{
"data": {
"record_format": "other",
"record_format_notes": "Custom binary format with header, see documentation for parser"
}
}
Rule 3: Human Annotations Require Label Types
Human Annotations
IF data.has_human_annotations is true
THEN data.label_types is required
Valid Example:
{
"data": {
"has_human_annotations": true,
"label_types": ["sentiment", "topic", "toxicity"]
}
}
Rule 4: Translation Pairs Require Multiple Languages
Translation Pairs
IF data.structures contains "translation-pairs"
THEN data.languages must have at least 2 items
Valid Example:
{
"data": {
"structures": ["translation-pairs"],
"languages": ["en", "ha-Latn-NG"]
}
}
Invalid Example:
{
"data": {
"structures": ["translation-pairs"],
"languages": ["en"]
}
}
Rule 5: Personal Data Requires Consent Mechanism
Personal Data Present
IF rights.contains_personal_data is "de_minimis", "pseudonymous", or "direct"
THEN rights.consent_mechanism is required
Valid Example:
{
"rights": {
"license": "CC-BY-4.0",
"allows_commercial_use": true,
"contains_personal_data": "pseudonymous",
"consent_mechanism": "explicit-opt-in"
}
}
Rule 6: Child Data Requires Consent Mechanism
Child Data
IF rights.child_data is true
THEN rights.consent_mechanism is required
Valid Example:
{
"rights": {
"license": "CC-BY-4.0",
"allows_commercial_use": false,
"contains_personal_data": "none",
"child_data": true,
"consent_mechanism": "parental-consent-obtained"
}
}
Rule 7: Not Available Requires Reason
Unavailable Dataset
IF access.availability is "not-available"
THEN access.not_available_reason is required
Valid Example:
{
"access": {
"availability": "not-available",
"not_available_reason": "Pending legal review, expected Q2 2025"
}
}
Rule 8: Restricted/On-Request Requires Instructions or URL
Restricted Access
IF access.availability is "restricted" or "on-request"
THEN either access.request_instructions or access.url is required
Valid Example (with instructions):
{
"access": {
"availability": "on-request",
"request_instructions": "Email data-access@example.org with research proposal"
}
}
Valid Example (with URL):
{
"access": {
"availability": "restricted",
"url": "https://example.org/apply-for-access"
}
}
Rule 9: Public Download Requires Download Location
Public Download
IF access.availability is "public-download"
THEN one of the following is required:
access.url, ORartifacts.base_uri, ORartifacts.files[].uri
Valid Example (with access.url):
{
"access": {
"availability": "public-download",
"url": "https://huggingface.co/datasets/example/corpus"
}
}
Valid Example (with artifacts):
{
"access": {
"availability": "public-download"
},
"artifacts": {
"base_uri": "https://cdn.example.org/datasets/v1/",
"files": [
{
"path": "train.jsonl",
"sha256": "abc123...",
"size_bytes": 1000000,
"split": "train"
}
]
}
}
Quick Reference Table
| Rule | IF condition | THEN required |
|---|---|---|
| 1 | data.kind in [“synthetic”, “hybrid”] | synthetic.generation_method, synthetic.share_of_dataset |
| 2 | data.record_format = “other” | data.record_format_notes |
| 3 | data.has_human_annotations = true | data.label_types |
| 4 | data.structures contains “translation-pairs” | data.languages minItems: 2 |
| 5 | rights.contains_personal_data != “none” | rights.consent_mechanism |
| 6 | rights.child_data = true | rights.consent_mechanism |
| 7 | access.availability = “not-available” | access.not_available_reason |
| 8 | access.availability in [“restricted”, “on-request”] | access.request_instructions OR access.url |
| 9 | access.availability = “public-download” | access.url OR artifacts.base_uri OR artifacts.files[].uri |
Validation Error Messages
When a conditional rule fails, the validator reports which required field is missing. Common error patterns:
Using the Validator
Test your data card against these rules using the Validator tool. The validator will report all conditional rule violations at once.
See Also
- Validator Tool - Test your data card
- Quickstart - Minimal valid example
- Rights Section - Personal data requirements
- Access Section - Availability requirements
- Synthetic Section - Synthetic data requirements