Overview#

RTP-LLM First Release Version:0.2.0(2025.09)

Features#

Model Family (Variants)	Example HuggingFace Identifier	Description	Support CardType
DeepSeek (v1, v2, v3/R1)	`deepseek-ai/DeepSeek-R1`	Series of advanced reasoning-optimized models (including a 671B MoE) trained with reinforcement learning; top performance on complex reasoning, math, and code tasks. RTP-LLM provides Deepseek v3/R1 model-specific optimizations	NV ✅ AMD ✅
Kimi (Kimi-K2)	`moonshotai/Kimi-K2-Instruct`	Moonshot’s MoE LLMs with 1 trillion parameters, exceptional on agentic intellegence	NV ✅ AMD ✅
Qwen (v1, v1.5, v2, v2.5, v3, QWQ, Qwen3-Coder)	`Qwen/Qwen3-235B-A22B`	Series of advanced reasoning-optimized models, Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise — achieving state-of-the-art results among open-source thinking models. Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences. Enhanced 256K long-context understanding capabilities.	NV ✅ AMD ✅
QwenVL (VL2, VL2.5, VL3)	`Qwen/Qwen2-VL-2B`	Series of advanced Vision-language model series based on Qwen2.5/Qwen3	NV ✅ AMD ❌
Llama	`meta-llama/Llama-4-Scout-17B-16E-Instruct`	Meta’s open LLM series, spanning 7B to 400B parameters (Llama 2, 3, and new Llama 4) with well-recognized performance.	NV ✅ AMD ✅

P/D Disaggregation dead lock casuse by request cancel/failed before remote running
Raw Request stream stop_words cause fake hang
some speculative decoding bugs
Warmup produce nan maybe influence kvcache
Not success query make bad kvcache case wrong answer
UseAllGather takes effect automatically according to the DP/TP
UseAllGather with deepgemm coredump cause by topk type is bad.
FlexLb too many log cause bad performance
Flexlb support PD_FUSION

In 3fs Case need more MEM or set FRONTEND_SERVER_COUNT=1 to reduce frontend_server mem usage in P/D when Use Frontend Disaggregation.
too many dynamic lora need more reserver_runtime_mem_mb
AMD not support MoE models
MoE model without shared_experter cannot use enable-layer-micro-batch
P/D Disaggregation with EPLB and MTP step > 1 may cause Prefill Hang
Embedding of VL Model is not ok cause by position id is wrong
FlexLb: Frequent switching of a large number of machines results in the performance degradation of flexlb