trinity.common.models.mm_utils module#
Utilities for processing multi-modal data (images/videos) for specific vision-language models.
Supported models: - Qwen2.5-VL, Qwen3-VL series - Kimi VL series - GLM VL series
Provides functions to: 1. Parse prompts with media tags (<image>/<video>) 2. Validate multi-modal content in conversations 3. Preprocess media inputs for inference/training 4. Construct model-compatible message formats
备注
Only processors with class names containing both ("Qwen", "Kimi" OR "Glm") AND "Processor" are supported. Relies on qwen_vl_utils.process_vision_info for media extraction.
- trinity.common.models.mm_utils.build_multi_modal_data(processor: Any, messages: List[Dict]) Dict[str, Any][源代码]#
Extract and preprocess vision inputs from multi-modal messages for vLLM inference.
Processes messages containing image/video placeholders using model-specific vision utilities. Returns structured media inputs compatible with vLLM's multi-modal API.
- 参数:
processor -- Vision-language processor instance (must have class name containing ("Qwen", "Kimi" OR "Glm") AND "Processor").
messages -- List of conversation messages in model-expected format. Each message's "content" may be a string or list of content items (text/image/video dictionaries).
- 返回:
"image": List of processed image objects (if images exist)
"video": List of processed video objects (if videos exist)
Keys are omitted when no corresponding media is present.
- 返回类型:
Dictionary containing processed media inputs with keys
- 抛出:
NotImplementedError -- If processor class name doesn't match supported patterns.
ImportError -- If required qwen_vl_utils module is unavailable.
示例
>>> messages = [{"role": "user", "content": [{"type": "image", "image": "img.jpg"}, {"type": "text", "text": "Describe this"}]}] >>> build_multi_modal_data(processor, messages) {"image": [processed_image]}
- trinity.common.models.mm_utils.build_mm_input_for_training(processor: Any, prompt: str, multi_modal_data: Dict[str, List]) Dict[str, Any][源代码]#
Tokenize prompt and integrate processed media inputs for model training.
Combines text prompt with preprocessed image/video data into model-ready tensor inputs. Handles padding and tensor conversion for training workflows.
- 参数:
processor -- Vision-language processor instance (must have class name containing ("Qwen", "Kimi" OR "Glm") AND "Processor").
prompt -- Plain text prompt WITHOUT media tags (e.g., "Describe this image"). Media placement is handled via multi_modal_data, not prompt tags.
multi_modal_data -- Dictionary from build_multi_modal_data() containing: {"image": [...], "video": [...]} (keys optional)
- 返回:
input_ids: Tokenized prompt IDs
attention_mask: Attention mask tensor
pixel_values: Processed image tensors (if images provided)
pixel_values_videos: Processed video tensors (if videos provided)
All tensors converted to PyTorch format (return_tensors="pt").
- 返回类型:
Dictionary of model inputs including
- 抛出:
NotImplementedError -- If processor class name doesn't match supported patterns.
ValueError -- If media counts mismatch prompt expectations (handled internally by processor).
备注
Prompt should NOT contain <image>/<video> tags here. Media association is managed through the structured multi_modal_data dictionary.
- trinity.common.models.mm_utils.build_mm_message(prompt: str, images: List[str | Any], videos: List[str | Any]) Dict[str, Any][源代码]#
Construct multi-modal message by injecting media references at tag positions in prompt.
Parses prompt for <image>/<video> tags, replaces them with corresponding media references, and handles surplus media items. Extra media (beyond tag count) is prepended to content.
- 参数:
prompt -- Text containing optional <image> and <video> tags as media placeholders. Example: "First <image> then <video> and finally <image>"
images -- List of image references (file paths, URLs, or PIL images) in order of appearance.
videos -- List of video references (file paths, URLs) in order of appearance.
- 返回:
- {
"role": "user", "content": [
{"type": "image", "image": ...}, # Surplus media first {"type": "video", "video": ...}, {"type": "text", "text": "First "}, {"type": "image", "image": ...}, # Tag-replaced media ...
]
}
- 返回类型:
Message dictionary formatted for VL models
- 抛出:
ValueError -- If prompt contains more <image> tags than provided images, or more <video> tags than provided videos.
- Behavior details:
Tags are case-sensitive and must be exact: "<image>", "<video>"
Empty text segments between tags are omitted
Surplus media (images/videos beyond tag count) appears at START of content list
Text segments preserve original prompt ordering around tags
- trinity.common.models.mm_utils.has_multi_modal_content(messages: List[Dict]) bool[源代码]#
Check if any message contains non-text (image/video) content.
Inspects message content structure to detect multi-modal elements. Handles both: - String content (text-only, returns False) - List content (multi-modal candidates)
- 参数:
messages -- List of conversation messages. Each message must contain a "content" field. Content may be: - str: Plain text message - List[Dict]: Multi-modal content items (each with "type" key)
- 返回:
True if any message contains at least one non-text content item (type != "text"), False otherwise.
示例
>>> msg = [{"role": "user", "content": [{"type": "text", "text": "Hi"}, {"type": "image", "image": "..."}]}] >>> has_multi_modal_content(msg) True