trinity.common.models.mm_utils module#

Utilities for processing multi-modal data (images/videos) for specific vision-language models.

Supported models: - Qwen2.5-VL, Qwen3-VL series - Kimi VL series - GLM VL series

Provides functions to: 1. Parse prompts with media tags (<image>/<video>) 2. Validate multi-modal content in conversations 3. Preprocess media inputs for inference/training 4. Construct model-compatible message formats

Note

Only processors with class names containing both (“Qwen”, “Kimi” OR “Glm”) AND “Processor” are supported. Relies on qwen_vl_utils.process_vision_info for media extraction.

trinity.common.models.mm_utils.is_qwen_like_processor(processor: Any) bool[source]#
trinity.common.models.mm_utils.build_multi_modal_data(processor: Any, messages: List[Dict]) Dict[str, Any][source]#

Extract and preprocess vision inputs from multi-modal messages for vLLM inference.

Processes messages containing image/video placeholders using model-specific vision utilities. Returns structured media inputs compatible with vLLM’s multi-modal API.

Parameters:
  • processor – Vision-language processor instance (must have class name containing (“Qwen”, “Kimi” OR “Glm”) AND “Processor”).

  • messages – List of conversation messages in model-expected format. Each message’s “content” may be a string or list of content items (text/image/video dictionaries).

Returns:

  • “image”: List of processed image objects (if images exist)

  • ”video”: List of processed video objects (if videos exist)

Keys are omitted when no corresponding media is present.

Return type:

Dictionary containing processed media inputs with keys

Raises:
  • NotImplementedError – If processor class name doesn’t match supported patterns.

  • ImportError – If required qwen_vl_utils module is unavailable.

Example

>>> messages = [{"role": "user", "content": [{"type": "image", "image": "img.jpg"}, {"type": "text", "text": "Describe this"}]}]
>>> build_multi_modal_data(processor, messages)
{"image": [processed_image]}
trinity.common.models.mm_utils.build_mm_input_for_training(processor: Any, prompt: str, multi_modal_data: Dict[str, List]) Dict[str, Any][source]#

Tokenize prompt and integrate processed media inputs for model training.

Combines text prompt with preprocessed image/video data into model-ready tensor inputs. Handles padding and tensor conversion for training workflows.

Parameters:
  • processor – Vision-language processor instance (must have class name containing (“Qwen”, “Kimi” OR “Glm”) AND “Processor”).

  • prompt – Plain text prompt WITHOUT media tags (e.g., “Describe this image”). Media placement is handled via multi_modal_data, not prompt tags.

  • multi_modal_data – Dictionary from build_multi_modal_data() containing: {“image”: […], “video”: […]} (keys optional)

Returns:

  • input_ids: Tokenized prompt IDs

  • attention_mask: Attention mask tensor

  • pixel_values: Processed image tensors (if images provided)

  • pixel_values_videos: Processed video tensors (if videos provided)

All tensors converted to PyTorch format (return_tensors=”pt”).

Return type:

Dictionary of model inputs including

Raises:
  • NotImplementedError – If processor class name doesn’t match supported patterns.

  • ValueError – If media counts mismatch prompt expectations (handled internally by processor).

Note

Prompt should NOT contain <image>/<video> tags here. Media association is managed through the structured multi_modal_data dictionary.

trinity.common.models.mm_utils.build_mm_message(prompt: str, images: List[str | Any], videos: List[str | Any]) Dict[str, Any][source]#

Construct multi-modal message by injecting media references at tag positions in prompt.

Parses prompt for <image>/<video> tags, replaces them with corresponding media references, and handles surplus media items. Extra media (beyond tag count) is prepended to content.

Parameters:
  • prompt – Text containing optional <image> and <video> tags as media placeholders. Example: “First <image> then <video> and finally <image>”

  • images – List of image references (file paths, URLs, or PIL images) in order of appearance.

  • videos – List of video references (file paths, URLs) in order of appearance.

Returns:

{

“role”: “user”, “content”: [

{“type”: “image”, “image”: …}, # Surplus media first {“type”: “video”, “video”: …}, {“type”: “text”, “text”: “First “}, {“type”: “image”, “image”: …}, # Tag-replaced media …

]

}

Return type:

Message dictionary formatted for VL models

Raises:

ValueError – If prompt contains more <image> tags than provided images, or more <video> tags than provided videos.

Behavior details:
  • Tags are case-sensitive and must be exact: “<image>”, “<video>”

  • Empty text segments between tags are omitted

  • Surplus media (images/videos beyond tag count) appears at START of content list

  • Text segments preserve original prompt ordering around tags

trinity.common.models.mm_utils.has_multi_modal_content(messages: List[Dict]) bool[source]#

Check if any message contains non-text (image/video) content.

Inspects message content structure to detect multi-modal elements. Handles both: - String content (text-only, returns False) - List content (multi-modal candidates)

Parameters:

messages – List of conversation messages. Each message must contain a “content” field. Content may be: - str: Plain text message - List[Dict]: Multi-modal content items (each with “type” key)

Returns:

True if any message contains at least one non-text content item (type != “text”), False otherwise.

Example

>>> msg = [{"role": "user", "content": [{"type": "text", "text": "Hi"}, {"type": "image", "image": "..."}]}]
>>> has_multi_modal_content(msg)
True