trinity.common.models.mm_utils module#

Utilities for processing multi-modal data (images/videos) for specific vision-language models.

Supported models: - Qwen2.5-VL, Qwen3-VL series - Kimi VL series - GLM VL series

Provides functions to: 1. Parse prompts with media tags (<image>/<video>) 2. Validate multi-modal content in conversations 3. Preprocess media inputs for inference/training 4. Construct model-compatible message formats

备注

Only processors with class names containing both ("Qwen", "Kimi" OR "Glm") AND "Processor" are supported. Relies on qwen_vl_utils.process_vision_info for media extraction.

trinity.common.models.mm_utils.is_qwen_like_processor(processor: Any) bool[源代码]#
trinity.common.models.mm_utils.build_multi_modal_data(processor: Any, messages: List[Dict]) Dict[str, Any][源代码]#

Extract and preprocess vision inputs from multi-modal messages for vLLM inference.

Processes messages containing image/video placeholders using model-specific vision utilities. Returns structured media inputs compatible with vLLM's multi-modal API.

参数:
  • processor -- Vision-language processor instance (must have class name containing ("Qwen", "Kimi" OR "Glm") AND "Processor").

  • messages -- List of conversation messages in model-expected format. Each message's "content" may be a string or list of content items (text/image/video dictionaries).

返回:

  • "image": List of processed image objects (if images exist)

  • "video": List of processed video objects (if videos exist)

Keys are omitted when no corresponding media is present.

返回类型:

Dictionary containing processed media inputs with keys

抛出:
  • NotImplementedError -- If processor class name doesn't match supported patterns.

  • ImportError -- If required qwen_vl_utils module is unavailable.

示例

>>> messages = [{"role": "user", "content": [{"type": "image", "image": "img.jpg"}, {"type": "text", "text": "Describe this"}]}]
>>> build_multi_modal_data(processor, messages)
{"image": [processed_image]}
trinity.common.models.mm_utils.build_mm_input_for_training(processor: Any, prompt: str, multi_modal_data: Dict[str, List]) Dict[str, Any][源代码]#

Tokenize prompt and integrate processed media inputs for model training.

Combines text prompt with preprocessed image/video data into model-ready tensor inputs. Handles padding and tensor conversion for training workflows.

参数:
  • processor -- Vision-language processor instance (must have class name containing ("Qwen", "Kimi" OR "Glm") AND "Processor").

  • prompt -- Plain text prompt WITHOUT media tags (e.g., "Describe this image"). Media placement is handled via multi_modal_data, not prompt tags.

  • multi_modal_data -- Dictionary from build_multi_modal_data() containing: {"image": [...], "video": [...]} (keys optional)

返回:

  • input_ids: Tokenized prompt IDs

  • attention_mask: Attention mask tensor

  • pixel_values: Processed image tensors (if images provided)

  • pixel_values_videos: Processed video tensors (if videos provided)

All tensors converted to PyTorch format (return_tensors="pt").

返回类型:

Dictionary of model inputs including

抛出:
  • NotImplementedError -- If processor class name doesn't match supported patterns.

  • ValueError -- If media counts mismatch prompt expectations (handled internally by processor).

备注

Prompt should NOT contain <image>/<video> tags here. Media association is managed through the structured multi_modal_data dictionary.

trinity.common.models.mm_utils.build_mm_message(prompt: str, images: List[str | Any], videos: List[str | Any]) Dict[str, Any][源代码]#

Construct multi-modal message by injecting media references at tag positions in prompt.

Parses prompt for <image>/<video> tags, replaces them with corresponding media references, and handles surplus media items. Extra media (beyond tag count) is prepended to content.

参数:
  • prompt -- Text containing optional <image> and <video> tags as media placeholders. Example: "First <image> then <video> and finally <image>"

  • images -- List of image references (file paths, URLs, or PIL images) in order of appearance.

  • videos -- List of video references (file paths, URLs) in order of appearance.

返回:

{

"role": "user", "content": [

{"type": "image", "image": ...}, # Surplus media first {"type": "video", "video": ...}, {"type": "text", "text": "First "}, {"type": "image", "image": ...}, # Tag-replaced media ...

]

}

返回类型:

Message dictionary formatted for VL models

抛出:

ValueError -- If prompt contains more <image> tags than provided images, or more <video> tags than provided videos.

Behavior details:
  • Tags are case-sensitive and must be exact: "<image>", "<video>"

  • Empty text segments between tags are omitted

  • Surplus media (images/videos beyond tag count) appears at START of content list

  • Text segments preserve original prompt ordering around tags

trinity.common.models.mm_utils.has_multi_modal_content(messages: List[Dict]) bool[源代码]#

Check if any message contains non-text (image/video) content.

Inspects message content structure to detect multi-modal elements. Handles both: - String content (text-only, returns False) - List content (multi-modal candidates)

参数:

messages -- List of conversation messages. Each message must contain a "content" field. Content may be: - str: Plain text message - List[Dict]: Multi-modal content items (each with "type" key)

返回:

True if any message contains at least one non-text content item (type != "text"), False otherwise.

示例

>>> msg = [{"role": "user", "content": [{"type": "text", "text": "Hi"}, {"type": "image", "image": "..."}]}]
>>> has_multi_modal_content(msg)
True