概述#

RTP-LLM 首个发布版本：0.2.0(2025.09)

功能特性#

模型系列（变体）	示例HuggingFace标识符	描述	支持的显卡类型
DeepSeek (v1, v2, v3/R1)	`deepseek-ai/DeepSeek-R1`	通过强化学习训练的高级推理优化模型系列（包括671B MoE）；在复杂推理、数学和代码任务上表现卓越。 RTP-LLM为Deepseek v3/R1模型提供特定优化	英伟达 ✅ AMD ✅
Kimi (Kimi-K2)	`moonshotai/Kimi-K2-Instruct`	月之暗面拥有1万亿参数的MoE大语言模型，在智能代理方面表现卓越	英伟达 ✅ AMD ✅
Qwen (v1, v1.5, v2, v2.5, v3, QWQ, Qwen3-Coder)	`Qwen/Qwen3-235B-A22B`	Series of advanced reasoning-optimized models, Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise — achieving state-of-the-art results among open-source thinking models. Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences. Enhanced 256K long-context understanding capabilities.	英伟达 ✅ AMD ✅
QwenVL (VL2, VL2.5, VL3)	`Qwen/Qwen2-VL-2B`	基于Qwen2.5/Qwen3的高级视觉语言模型系列	英伟达 ✅ AMD ❌
Llama	`meta-llama/Llama-4-Scout-17B-16E-Instruct`	Meta的开放大语言模型系列，参数规模从7B到400B（Llama 2、3和新Llama 4），具有广受认可的性能。	英伟达 ✅ AMD ✅

在3FS情况下，使用前端分离架构时需要更多内存或设置FRONTEND_SERVER_COUNT=1来减少P/D中frontend_server的内存使用。
过多的动态LoRA需要更多的reserver_runtime_mem_mb
AMD不支持MoE模型
没有shared_experter的MoE模型无法使用enable-layer-micro-batch
带有EPLB和MTP step > 1的P/D分离架构可能导致Prefill挂起
Embedding of VL Model is not ok cause by position id is wrong
FlexLb: Frequent switching of a large number of machines results in the performance degradation of flexlb