Multimodal Language Models#
These models accept multi-modal inputs (e.g., images/video and text) and generate text output. They augment language models with multimodal encoders.
Example launch Command#
python3 -m rtp_llm.start_server \
--checkpoint_path /mnt/nas1/hf/Qwen-VL-Chat \ # example local path
--model_type qwen_vl \
--act_type bf16 \
--port 8088 \
Supported models#
Below the supported models are summarized in a table.
If you are unsure if a specific architecture is implemented, you can search for it via GitHub. For example, to search for Qwen2_5_VLForConditionalGeneration
, use the expression:
repo:foundation_models/RTP-LLM- path:/^rtp_llm\/models\/qwen_vl.py\// Qwen2_5_VLForConditionalGeneration
in the GitHub search bar.
Model Family (Variants) |
Example HuggingFace Identifier |
Chat Template |
Model Type |
Description |
---|---|---|---|---|
Qwen3-VL (Qwen3 series) |
|
|
|
Alibaba’s vision-language extension of Qwen3MoE; for example, Qwen3-VL (7B and larger variants) can analyze and converse about image content. |
Qwen-VL (Qwen2.5 series) |
|
|
|
Alibaba’s vision-language extension of Qwen; for example, Qwen2.5-VL (7B and larger variants) can analyze and converse about image content. |
Qwen-VL (Qwen2 series) |
|
|
|
Alibaba’s vision-language extension of Qwen; for example, Qwen2-VL (7B and larger variants) can analyze and converse about image content. |
Qwen-VL (Qwen series) |
|
|
|
Alibaba’s vision-language extension of Qwen; for example, Qwen-VL (7B and larger variants) can analyze and converse about image content. |
DeepSeek-VL2 |
|
|
|
Vision-language variant of DeepSeek (with a dedicated image processor), enabling advanced multimodal reasoning on image and text inputs. |
MiniCPM-V / MiniCPM-o |
|
|
|
MiniCPM-V (2.6, ~8B) supports image inputs, and MiniCPM-o adds audio/video; these multimodal LLMs are optimized for end-side deployment on mobile/edge devices. |
Llama 3.2 Vision (11B) |
|
|
|
Vision-enabled variant of Llama 3 (11B) that accepts image inputs for visual question answering and other multimodal tasks. |
LLaVA (v1.5 & v1.6) |
e.g. |
|
|
Open vision-chat models that add an image encoder to LLaMA/Vicuna (e.g. LLaMA2 13B) for following multimodal instruction prompts. |
LLaVA-NeXT (8B, 72B) |
|
|
|
Improved LLaVA models (with an 8B Llama3 version and a 72B version) offering enhanced visual instruction-following and accuracy on multimodal benchmarks. |
LLaVA-OneVision |
|
|
|
Enhanced LLaVA variant integrating Qwen as the backbone; supports multiple images (and even video frames) as inputs via an OpenAI Vision API-compatible format. |
ChatGlmV4Vision |
|
|
|
GLM-4V is a multimodal language model with visual understanding capabilities. |
InternVL |
|
|
|
A pioneering open-source alternative to GPT-4V |