trinity.common.models.mm_utils module#
Utilities for processing multi-modal data (images/videos) for specific vision-language models.
Supported models: - Qwen2.5-VL, Qwen3-VL series - Kimi VL series - GLM VL series
Provides functions to: 1. Parse prompts with media tags (<image>/<video>) 2. Validate multi-modal content in conversations 3. Preprocess media inputs for inference/training 4. Construct model-compatible message formats
Note
Only processors with class names containing both (“Qwen”, “Kimi” OR “Glm”) AND “Processor” are supported. Relies on qwen_vl_utils.process_vision_info for media extraction.
- trinity.common.models.mm_utils.build_multi_modal_data(processor: Any, messages: List[Dict]) Dict[str, Any][source]#
Extract and preprocess vision inputs from multi-modal messages for vLLM inference.
Processes messages containing image/video placeholders using model-specific vision utilities. Returns structured media inputs compatible with vLLM’s multi-modal API.
- Parameters:
processor – Vision-language processor instance (must have class name containing (“Qwen”, “Kimi” OR “Glm”) AND “Processor”).
messages – List of conversation messages in model-expected format. Each message’s “content” may be a string or list of content items (text/image/video dictionaries).
- Returns:
“image”: List of processed image objects (if images exist)
”video”: List of processed video objects (if videos exist)
Keys are omitted when no corresponding media is present.
- Return type:
Dictionary containing processed media inputs with keys
- Raises:
NotImplementedError – If processor class name doesn’t match supported patterns.
ImportError – If required qwen_vl_utils module is unavailable.
Example
>>> messages = [{"role": "user", "content": [{"type": "image", "image": "img.jpg"}, {"type": "text", "text": "Describe this"}]}] >>> build_multi_modal_data(processor, messages) {"image": [processed_image]}
- trinity.common.models.mm_utils.build_mm_input_for_training(processor: Any, prompt: str, multi_modal_data: Dict[str, List]) Dict[str, Any][source]#
Tokenize prompt and integrate processed media inputs for model training.
Combines text prompt with preprocessed image/video data into model-ready tensor inputs. Handles padding and tensor conversion for training workflows.
- Parameters:
processor – Vision-language processor instance (must have class name containing (“Qwen”, “Kimi” OR “Glm”) AND “Processor”).
prompt – Plain text prompt WITHOUT media tags (e.g., “Describe this image”). Media placement is handled via multi_modal_data, not prompt tags.
multi_modal_data – Dictionary from build_multi_modal_data() containing: {“image”: […], “video”: […]} (keys optional)
- Returns:
input_ids: Tokenized prompt IDs
attention_mask: Attention mask tensor
pixel_values: Processed image tensors (if images provided)
pixel_values_videos: Processed video tensors (if videos provided)
All tensors converted to PyTorch format (return_tensors=”pt”).
- Return type:
Dictionary of model inputs including
- Raises:
NotImplementedError – If processor class name doesn’t match supported patterns.
ValueError – If media counts mismatch prompt expectations (handled internally by processor).
Note
Prompt should NOT contain <image>/<video> tags here. Media association is managed through the structured multi_modal_data dictionary.
- trinity.common.models.mm_utils.build_mm_message(prompt: str, images: List[str | Any], videos: List[str | Any]) Dict[str, Any][source]#
Construct multi-modal message by injecting media references at tag positions in prompt.
Parses prompt for <image>/<video> tags, replaces them with corresponding media references, and handles surplus media items. Extra media (beyond tag count) is prepended to content.
- Parameters:
prompt – Text containing optional <image> and <video> tags as media placeholders. Example: “First <image> then <video> and finally <image>”
images – List of image references (file paths, URLs, or PIL images) in order of appearance.
videos – List of video references (file paths, URLs) in order of appearance.
- Returns:
- {
“role”: “user”, “content”: [
{“type”: “image”, “image”: …}, # Surplus media first {“type”: “video”, “video”: …}, {“type”: “text”, “text”: “First “}, {“type”: “image”, “image”: …}, # Tag-replaced media …
]
}
- Return type:
Message dictionary formatted for VL models
- Raises:
ValueError – If prompt contains more <image> tags than provided images, or more <video> tags than provided videos.
- Behavior details:
Tags are case-sensitive and must be exact: “<image>”, “<video>”
Empty text segments between tags are omitted
Surplus media (images/videos beyond tag count) appears at START of content list
Text segments preserve original prompt ordering around tags
- trinity.common.models.mm_utils.has_multi_modal_content(messages: List[Dict]) bool[source]#
Check if any message contains non-text (image/video) content.
Inspects message content structure to detect multi-modal elements. Handles both: - String content (text-only, returns False) - List content (multi-modal candidates)
- Parameters:
messages – List of conversation messages. Each message must contain a “content” field. Content may be: - str: Plain text message - List[Dict]: Multi-modal content items (each with “type” key)
- Returns:
True if any message contains at least one non-text content item (type != “text”), False otherwise.
Example
>>> msg = [{"role": "user", "content": [{"type": "text", "text": "Hi"}, {"type": "image", "image": "..."}]}] >>> has_multi_modal_content(msg) True