概述#

RTP-LLM 首个发布版本:0.2.0(2025.09)

功能特性#

框架高级特性#

新增模型#

模型系列(变体)

示例HuggingFace标识符

描述

支持的显卡类型

DeepSeek (v1, v2, v3/R1)

deepseek-ai/DeepSeek-R1

通过强化学习训练的高级推理优化模型系列(包括671B MoE);
在复杂推理、数学和代码任务上表现卓越。
RTP-LLM为Deepseek v3/R1模型提供特定优化

英伟达 ✅
AMD ✅

Kimi (Kimi-K2)

moonshotai/Kimi-K2-Instruct

月之暗面拥有1万亿参数的MoE大语言模型,在智能代理方面表现卓越

英伟达 ✅
AMD ✅

Qwen (v1, v1.5, v2, v2.5, v3, QWQ, Qwen3-Coder)

Qwen/Qwen3-235B-A22B

Series of advanced reasoning-optimized models,
Significantly improved performance on reasoning tasks,
including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise — achieving state-of-the-art results among open-source thinking models.
Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences.
Enhanced 256K long-context understanding capabilities.

英伟达 ✅
AMD ✅

QwenVL (VL2, VL2.5, VL3)

Qwen/Qwen2-VL-2B

基于Qwen2.5/Qwen3的高级视觉语言模型系列

英伟达 ✅
AMD ❌

Llama

meta-llama/Llama-4-Scout-17B-16E-Instruct

Meta的开放大语言模型系列,参数规模从7B到400B(Llama 2、3和新Llama 4),具有广受认可的性能。

英伟达 ✅
AMD ✅

错误修复#

  • 由远程运行前请求取消/失败导致的P/D分离架构死锁

  • 原始请求流stop_words导致假挂起

  • some speculative decoding bugs

  • Warmup produce nan maybe influence kvcache

  • Not success query make bad kvcache case wrong answer

  • UseAllGather takes effect automatically according to the DP/TP

  • UseAllGather with deepgemm coredump cause by topk type is bad.

  • FlexLb too many log cause bad performance

  • Flexlb support PD_FUSION

遗漏问题#

  • 在3FS情况下,使用前端分离架构时需要更多内存或设置FRONTEND_SERVER_COUNT=1来减少P/D中frontend_server的内存使用。

  • 过多的动态LoRA需要更多的reserver_runtime_mem_mb

  • AMD不支持MoE模型

  • 没有shared_experter的MoE模型无法使用enable-layer-micro-batch

  • 带有EPLB和MTP step > 1的P/D分离架构可能导致Prefill挂起

  • Embedding of VL Model is not ok cause by position id is wrong

  • FlexLb: Frequent switching of a large number of machines results in the performance degradation of flexlb

性能#

兼容性#