Assets

Assets

The assets.jsonl file provides a centralized index for deduplicating media references across your dataset.

Why Use Assets?

When the same image, audio, or video appears in multiple records, referencing it by asset_id instead of repeating the full URI:

  • Reduces file size - No repeated URIs
  • Simplifies updates - Change URI in one place
  • Enables validation - Centralized checksums

Configuration

Reference the asset index in metadata.json:

{
  "asset_index": {
    "path": "assets.jsonl",
    "format": "jsonl"
  }
}

Asset Format

Each line in assets.jsonl is one asset:

{"asset_id":"img-001","uri":"s3://bucket/images/photo1.jpg","mime_type":"image/jpeg","sha256":"abc...","bytes":123456}
{"asset_id":"img-002","uri":"s3://bucket/images/photo2.png","mime_type":"image/png","sha256":"def...","bytes":234567}
{"asset_id":"aud-001","uri":"s3://bucket/audio/clip.wav","mime_type":"audio/wav","sha256":"ghi...","bytes":345678}

Asset Fields

FieldRequiredDescription
asset_idYesUnique identifier
uriYesURI to the actual file
mime_typeNoMIME type
sha256NoSHA-256 hash
bytesNoFile size
metadataNoCustom metadata

Using Assets in Records

Reference assets by ID in content parts:

{
  "role": "user",
  "content": [
    { "type": "text", "text": "Describe this image:" },
    { "type": "image", "ref": { "asset_id": "img-001" } }
  ]
}

Direct URIs vs Asset References

You can also use direct URIs:

{ "type": "image", "ref": { "uri": "s3://bucket/one-off-image.jpg" } }

When to use each:

ApproachBest For
asset_idRepeated media, large datasets
Direct uriOne-off media, small datasets

Validation

The schema requires content parts with media types (image, audio, video, document) to have a ref object with either asset_id or uri.

Example assets.jsonl

{"asset_id":"img_000001","uri":"s3://my-bucket/images/cat.jpg","mime_type":"image/jpeg","sha256":"aaaa...","bytes":123456}
{"asset_id":"img_000002","uri":"s3://my-bucket/images/dog.png","mime_type":"image/png","sha256":"bbbb...","bytes":234567}
{"asset_id":"aud_000001","uri":"s3://my-bucket/audio/speech.wav","mime_type":"audio/wav","sha256":"cccc...","bytes":345678}
{"asset_id":"vid_000001","uri":"s3://my-bucket/video/demo.mp4","mime_type":"video/mp4","sha256":"dddd...","bytes":456789}
{"asset_id":"doc_000001","uri":"s3://my-bucket/docs/manual.pdf","mime_type":"application/pdf","sha256":"eeee...","bytes":567890}

Best Practices

  1. Consistent naming - Use prefixes like img_, aud_, vid_
  2. Include checksums - Enable integrity verification
  3. Track file sizes - Helps with storage planning
  4. Keep URIs stable - Avoid changing URIs after publishing