ServerArgs#
This page lists server arguments used to configure the behavior and performance of the language model server via command line. These parameters allow users to customize key server functionalities, including model selection, parallel strategies, memory management, and optimization techniques.
Parallelism and Distributed Setup Configuration#
Arguments |
Description |
Defaults |
---|---|---|
|
Specifies the tensor parallelism degree. |
None |
|
Defines the number of model instances for expert parallelism. |
None |
|
Sets the number of replicas or group size for data parallelism. |
None |
|
Total number of GPUs used in distributed setup (WORLD_SIZE = TP_SIZE * DP_SIZE). |
None |
|
Global unique ID of the current process/GPU in the distributed system. |
None |
|
Number of GPU devices used on the current node. |
None |
|
Enables FFN disaggregation feature to separate attention and feed-forward network computations for performance optimization. |
None |
Concurrency Control#
Arguments |
Description |
Defaults |
---|---|---|
|
Controls blocking behavior for concurrent requests. |
False |
|
Maximum number of concurrent requests allowed by the system. |
32 |
Attention Optimization#
Arguments |
Description |
Defaults |
---|---|---|
|
Enables Fused Multi-Head Attention (FMHA) feature. |
True |
|
Enables TensorRT optimized FMHA feature. |
True |
|
Enables Paged TensorRT FMHA. |
True |
|
Enables open-source FMHA implementation. |
True |
|
Enables Paged open-source FMHA implementation. |
True |
|
Enables TRTv1-style FMHA. |
True |
|
Enables NVTX performance profiling for FMHA. |
False |
|
Displays FMHA parameter information. |
False |
|
Disables FlashInfer Attention mechanism. |
False |
|
Enables XQA feature (requires SM90+ GPU). |
True |
KV Cache Configuration#
Arguments |
Description |
Defaults |
---|---|---|
|
Activates KV Cache reuse mechanism. |
False |
|
Multi-task prompt file path. |
None |
|
Multi-task prompt JSON string. |
None |
Hardware/Kernel Optimization#
Arguments |
Description |
Defaults |
---|---|---|
|
Number of SMs used for DeepGEMM. |
None |
|
Enables KleidiAI support for ARM GEMM. |
False |
|
Enables stable scatter add operation. |
False |
|
Enables multi-block mode for MMHA. |
True |
|
hipBLASLt GEMM configuration file path. |
gemm_config.csv |
|
Disables custom AllReduce implementation. |
True |
Device Resource Management#
Arguments |
Description |
Defaults |
---|---|---|
|
Amount of GPU memory to reserve (bytes). |
0 |
|
Amount of CPU memory to reserve (bytes). |
4GB |
|
Number of SMs for compute-communication overlap optimization. |
0 |
|
Compute-communication overlap strategy type. |
0 |
|
M_SPLIT parameter for device operations. |
0 |
|
Enables compute-communication overlapping execution. |
True |
|
Enables layer-level micro-batching. |
0 |
|
Do not use default CUDA stream. |
False |
DeepEP Configuration#
Arguments |
Description |
Defaults |
---|---|---|
|
Enables DeepEP for MoE processing. Single EP shoude set be false |
False |
|
Enables inter-node communication optimization. |
False |
|
Enables DeepEP low-latency mode. |
True |
|
Enables P2P low-latency mode. |
False |
|
Number of SMs for DeepEPBuffer. |
0 |
EPLB Configuration#
Arguments |
Description |
Defaults |
---|---|---|
|
EPLB mode |
“NONE” |
|
EPLB load balancing method. |
“mix” |
|
Number of redundant experts. |
0 |
|
EPLB execution cycle. |
5000 |
|
Number of layers updated per EPLB update. |
1 |
|
Globally repack EPLB experts. |
False |
|
EPLB statistics window size. |
10 |
|
(DEBUG) EPLB synchronization control parameter cycle. |
100 |
|
(DEBUG) Enables ExpertBalancer test mode |
False |
|
(DEBUG) Enables expert pseudo-balancing mechanism. |
False |
Sampling Configuration#
Arguments |
Description |
Defaults |
---|---|---|
|
Override system maximum batch size. |
0 |
|
Enables FlashInfer sampling kernel. |
True |
Logging & Profiling#
Arguments |
Description |
Defaults |
---|---|---|
|
Enables NVTX performance profiling. |
False |
|
Logs inference response content. |
False |
|
Enables memory tracing. |
False |
|
Enables malloc stack tracing. |
False |
|
Collects device performance metrics. |
False |
|
Generates core dump on exception. |
False |
|
Log configuration file path. |
None |
|
Log level (ERROR/WARN/INFO/DEBUG). |
INFO |
|
Collects Timeline analysis data. |
False |
|
Torch Profiler output directory. |
“” |
Speculative Decoding#
Arguments |
Description |
Defaults |
---|---|---|
|
Specifies draft model type (e.g. “deepseek-v3-mtp”) |
“” |
|
Controls speculative sampling type (“vanilla” disables, “mtp” enables) |
“” |
|
Minimum token match length |
2 |
|
Maximum token match length |
2 |
|
Tree decode mapping configuration file |
“” |
|
Maximum number of tokens generated per cycle |
1 |
|
Forces streaming sampling |
False |
|
Forces context attention scoring |
True |
RPC and Service Discovery#
Arguments |
Description |
Defaults |
---|---|---|
|
Uses local service discovery |
False |
|
Remote RPC server address |
None |
|
Decode service discovery configuration |
None |
|
Remote ViT server address |
None |
|
Multimodal service discovery configuration |
None |
Cache Store#
Arguments |
Description |
Defaults |
---|---|---|
|
Enables RDMA mode |
False |
|
WRR load balancing availability threshold |
80 |
|
WRR ranking factor (0=KV_CACHE usage, 1=in-flight requests) |
0 |
Scheduler Configuration#
Arguments |
Description |
Defaults |
---|---|---|
|
Enables batch decode scheduler |
False |
|
Maximum context batch size |
1 |
|
Reserved resource percentage |
5 |
|
Enables long request chunking processing |
False |
|
Chunk processing size |
None |
|
Allows partial resource reclamation |
False |
|
Decode batch size |
1 |
Load Balancing and Performance Optimization Configuration#
Arguments |
Description |
Defaults |
---|---|---|
|
Enables dynamic load balancing |
False |
|
Performance record retention time window (microseconds) |
60000000 |
|
Maximum performance record count |
1000 |
|
Disables PDL feature |
False |
3FS Configuration#
Arguments |
Description |
Defaults |
---|---|---|
|
Enables 3FS for managing KVCache |
False |
Model Adaptation#
Arguments |
Description |
Defaults |
---|---|---|
|
Maximum size limit for LoRA models |
-1 |
System Debugging#
Arguments |
Description |
Defaults |
---|---|---|
|
Collects Timeline analysis data |
False |
|
Torch Profiler output directory |
“” |