Overview#
RTP-LLM First Release Version:0.2.0(2025.09)
Features#
Framkework Advanced Feature#
New Models#
Model Family (Variants) |
Example HuggingFace Identifier |
Description |
Support CardType |
|---|---|---|---|
DeepSeek (v1, v2, v3/R1) |
|
Series of advanced reasoning-optimized models (including a 671B MoE) trained with reinforcement learning; |
NV ✅ |
Kimi (Kimi-K2) |
|
Moonshot’s MoE LLMs with 1 trillion parameters, exceptional on agentic intellegence |
NV ✅ |
Qwen (v1, v1.5, v2, v2.5, v3, QWQ, Qwen3-Coder) |
|
Series of advanced reasoning-optimized models, |
NV ✅ |
QwenVL (VL2, VL2.5, VL3) |
|
Series of advanced Vision-language model series based on Qwen2.5/Qwen3 |
NV ✅ |
Llama |
|
Meta’s open LLM series, spanning 7B to 400B parameters (Llama 2, 3, and new Llama 4) with well-recognized performance. |
NV ✅ |
Bug Fixs#
P/D Disaggregation dead lock casuse by request cancel/failed before remote running
Raw Request stream stop_words cause fake hang
some speculative decoding bugs
Warmup produce nan maybe influence kvcache
Not success query make bad kvcache case wrong answer
UseAllGather takes effect automatically according to the DP/TP
UseAllGather with deepgemm coredump cause by topk type is bad.
FlexLb too many log cause bad performance
Flexlb support PD_FUSION
Question of omission#
In 3fs Case need more MEM or set FRONTEND_SERVER_COUNT=1 to reduce frontend_server mem usage in P/D when Use Frontend Disaggregation.
too many dynamic lora need more reserver_runtime_mem_mb
AMD not support MoE models
MoE model without shared_experter cannot use enable-layer-micro-batch
P/D Disaggregation with EPLB and MTP step > 1 may cause Prefill Hang
Embedding of VL Model is not ok cause by position id is wrong
FlexLb: Frequent switching of a large number of machines results in the performance degradation of flexlb