Multimodal Data
Multimodal Data
LLM Training Data supports rich multimodal content including images, audio, video, and documents.
Supported Media Types
| Type | Content Part | Common Formats |
|---|---|---|
| Images | image | JPEG, PNG, WebP, GIF |
| Audio | audio | WAV, MP3, FLAC, OGG |
| Video | video | MP4, WebM, AVI |
| Documents | document | PDF, DOCX |
Basic Multimodal Record
{
"id": "mm-001",
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "What's in this image?" },
{ "type": "image", "ref": { "asset_id": "img-001" } }
]
},
{
"role": "assistant",
"content": [
{ "type": "text", "text": "The image shows a golden retriever playing in a park." }
]
}
]
}
Asset Management
For large datasets, use assets.jsonl to deduplicate media references:
assets.jsonl:
{"asset_id":"img-001","uri":"s3://bucket/images/dog.jpg","mime_type":"image/jpeg","sha256":"abc...","bytes":123456}
{"asset_id":"img-002","uri":"s3://bucket/images/cat.png","mime_type":"image/png","sha256":"def...","bytes":234567}
Record:
{
"type": "image",
"ref": { "asset_id": "img-001" }
}
Direct URI References
For one-off media, use direct URIs:
{
"type": "image",
"ref": { "uri": "s3://bucket/unique-image.jpg" },
"mime_type": "image/jpeg",
"sha256": "ghi...",
"bytes": 345678
}
Audio Transcription Example
{
"id": "audio-001",
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "Transcribe this audio clip:" },
{ "type": "audio", "ref": { "asset_id": "aud-001" }, "mime_type": "audio/wav" }
]
},
{
"role": "assistant",
"content": [
{ "type": "text", "text": "The speaker says: 'Hello and welcome to our podcast.'" }
]
}
]
}
Video Understanding Example
{
"id": "video-001",
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "Describe what happens in this video:" },
{ "type": "video", "ref": { "asset_id": "vid-001" }, "mime_type": "video/mp4" }
]
},
{
"role": "assistant",
"content": [
{ "type": "text", "text": "The video shows a chef preparing a salad. First, they wash the lettuce, then chop tomatoes and cucumbers, and finally toss everything with dressing." }
]
}
]
}
Document Analysis Example
{
"id": "doc-001",
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "What is the main topic of this document?" },
{ "type": "document", "ref": { "asset_id": "doc-001" }, "mime_type": "application/pdf" }
]
},
{
"role": "assistant",
"content": [
{ "type": "text", "text": "This document is a quarterly financial report covering revenue, expenses, and profit margins for Q3 2024." }
]
}
]
}
Multiple Media in One Message
{
"role": "user",
"content": [
{ "type": "text", "text": "Compare these two images:" },
{ "type": "image", "ref": { "asset_id": "img-before" } },
{ "type": "image", "ref": { "asset_id": "img-after" } }
]
}
Interleaved Text and Media
{
"role": "user",
"content": [
{ "type": "text", "text": "Here's the first chart:" },
{ "type": "image", "ref": { "asset_id": "chart-1" } },
{ "type": "text", "text": "And here's the second:" },
{ "type": "image", "ref": { "asset_id": "chart-2" } },
{ "type": "text", "text": "What trends do you see?" }
]
}
Best Practices
- Optimize file sizes - Resize and compress media appropriately
- Use consistent formats - Standardize on a few formats per media type
- Include MIME types - Help processors handle files correctly
- Add checksums - Verify file integrity
- Deduplicate - Use assets.jsonl for repeated media
- Balance modalities - Ensure good coverage across media types