Assets
Assets
The assets.jsonl file provides a centralized index for deduplicating media references across your dataset.
Why Use Assets?
When the same image, audio, or video appears in multiple records, referencing it by asset_id instead of repeating the full URI:
- Reduces file size - No repeated URIs
- Simplifies updates - Change URI in one place
- Enables validation - Centralized checksums
Configuration
Reference the asset index in metadata.json:
{
"asset_index": {
"path": "assets.jsonl",
"format": "jsonl"
}
}
Asset Format
Each line in assets.jsonl is one asset:
{"asset_id":"img-001","uri":"s3://bucket/images/photo1.jpg","mime_type":"image/jpeg","sha256":"abc...","bytes":123456}
{"asset_id":"img-002","uri":"s3://bucket/images/photo2.png","mime_type":"image/png","sha256":"def...","bytes":234567}
{"asset_id":"aud-001","uri":"s3://bucket/audio/clip.wav","mime_type":"audio/wav","sha256":"ghi...","bytes":345678}
Asset Fields
| Field | Required | Description |
|---|---|---|
asset_id | Yes | Unique identifier |
uri | Yes | URI to the actual file |
mime_type | No | MIME type |
sha256 | No | SHA-256 hash |
bytes | No | File size |
metadata | No | Custom metadata |
Using Assets in Records
Reference assets by ID in content parts:
{
"role": "user",
"content": [
{ "type": "text", "text": "Describe this image:" },
{ "type": "image", "ref": { "asset_id": "img-001" } }
]
}
Direct URIs vs Asset References
You can also use direct URIs:
{ "type": "image", "ref": { "uri": "s3://bucket/one-off-image.jpg" } }
When to use each:
| Approach | Best For |
|---|---|
asset_id | Repeated media, large datasets |
Direct uri | One-off media, small datasets |
Validation
The schema requires content parts with media types (image, audio, video, document) to have a ref object with either asset_id or uri.
Example assets.jsonl
{"asset_id":"img_000001","uri":"s3://my-bucket/images/cat.jpg","mime_type":"image/jpeg","sha256":"aaaa...","bytes":123456}
{"asset_id":"img_000002","uri":"s3://my-bucket/images/dog.png","mime_type":"image/png","sha256":"bbbb...","bytes":234567}
{"asset_id":"aud_000001","uri":"s3://my-bucket/audio/speech.wav","mime_type":"audio/wav","sha256":"cccc...","bytes":345678}
{"asset_id":"vid_000001","uri":"s3://my-bucket/video/demo.mp4","mime_type":"video/mp4","sha256":"dddd...","bytes":456789}
{"asset_id":"doc_000001","uri":"s3://my-bucket/docs/manual.pdf","mime_type":"application/pdf","sha256":"eeee...","bytes":567890}
Best Practices
- Consistent naming - Use prefixes like
img_,aud_,vid_ - Include checksums - Enable integrity verification
- Track file sizes - Helps with storage planning
- Keep URIs stable - Avoid changing URIs after publishing