ServerArgs#

This page lists server arguments used to configure the behavior and performance of the language model server via command line. These parameters allow users to customize key server functionalities, including model selection, parallel strategies, memory management, and optimization techniques.

Parallelism and Distributed Setup Configuration#

Arguments	Description	Defaults
`--tp-size`	Specifies the tensor parallelism degree.	None
`--ep-size`	Defines the number of model instances for expert parallelism.	None
`--dp-size`	Sets the number of replicas or group size for data parallelism.	None
`--world-size`	Total number of GPUs used in distributed setup (WORLD_SIZE = TP_SIZE * DP_SIZE).	None
`--world-rank`	Global unique ID of the current process/GPU in the distributed system.	None
`--local-world-size`	Number of GPU devices used on the current node.	None
`--enable_ffn_disaggregate`	Enables FFN disaggregation feature to separate attention and feed-forward network computations for performance optimization.	None

Concurrency Control#

Arguments	Description	Defaults
`--concurrency-with-block`	Controls blocking behavior for concurrent requests.	False
`--concurrency-limit`	Maximum number of concurrent requests allowed by the system.	32

Attention Optimization #

Arguments	Description	Defaults
`--enable-fmha`	Enables Fused Multi-Head Attention (FMHA) feature.	True
`--enable-trt-fmha`	Enables TensorRT optimized FMHA feature.	True
`--enable-paged-trt-fmha`	Enables Paged TensorRT FMHA.	True
`--enable-open-source-fmha`	Enables open-source FMHA implementation.	True
`--enable-paged-open-source-fmha`	Enables Paged open-source FMHA implementation.	True
`--enable-trtv1-fmha`	Enables TRTv1-style FMHA.	True
`--fmha-perf-instrument`	Enables NVTX performance profiling for FMHA.	False
`--fmha-show-params`	Displays FMHA parameter information.	False
`--disable-flash-infer`	Disables FlashInfer Attention mechanism.	False
`--enable-xqa`	Enables XQA feature (requires SM90+ GPU).	True

KV Cache Configuration#

Arguments	Description	Defaults
`--reuse-cache`	Activates KV Cache reuse mechanism.	False
`--multi-task-prompt`	Multi-task prompt file path.	None
`--multi-task-prompt-str`	Multi-task prompt JSON string.	None

Hardware/Kernel Optimization#

Arguments	Description	Defaults
`--deep-gemm-num-sm`	Number of SMs used for DeepGEMM.	None
`--arm-gemm-use-kai`	Enables KleidiAI support for ARM GEMM.	False
`--enable-stable-scatter-add`	Enables stable scatter add operation.	False
`--enable-multi-block-mode`	Enables multi-block mode for MMHA.	True
`--rocm-hipblaslt-config`	hipBLASLt GEMM configuration file path.	gemm_config.csv
`--ft-disable-custom-ar`	Disables custom AllReduce implementation.	True

Device Resource Management#

Arguments	Description	Defaults
`--device-reserve-memory-bytes`	Amount of GPU memory to reserve (bytes).	0
`--host-reserve-memory-bytes`	Amount of CPU memory to reserve (bytes).	4GB
`--overlap-math-sm-count`	Number of SMs for compute-communication overlap optimization.	0
`--overlap-comm-type`	Compute-communication overlap strategy type.	0
`--m-split`	M_SPLIT parameter for device operations.	0
`--enable-comm-overlap`	Enables compute-communication overlapping execution.	True
`--enable-layer-micro-batch`	Enables layer-level micro-batching.	0
`--not-use-default-stream`	Do not use default CUDA stream.	False

DeepEP Configuration#

Arguments	Description	Defaults
`--use-deepep-moe`	Enables DeepEP for MoE processing. Single EP shoude set be false	False
`--use-deepep-internode`	Enables inter-node communication optimization.	False
`--use-deepep-low-latency`	Enables DeepEP low-latency mode.	True
`--use-deepep-p2p-low-latency`	Enables P2P low-latency mode.	False
`--deep-ep-num-sm`	Number of SMs for DeepEPBuffer.	0

EPLB Configuration#

Arguments	Description	Defaults
`--eplb_mode`	EPLB mode	“NONE”
`--balance_method`	EPLB load balancing method.	“mix”
`--redundant_expert`	Number of redundant experts.	0
`--eplb_update_time`	EPLB execution cycle.	5000
`--eplb_balance_layer_per_step`	Number of layers updated per EPLB update.	1
`--eplb_force_repack`	Globally repack EPLB experts.	False
`--eplb_stats_window_size`	EPLB statistics window size.	10
`--eplb_control_step`	(DEBUG) EPLB synchronization control parameter cycle.	100
`--eplb_test_mode`	(DEBUG) Enables ExpertBalancer test mode	False
`--fake_balance_expert`	(DEBUG) Enables expert pseudo-balancing mechanism.	False

Sampling Configuration#

Arguments	Description	Defaults
`--max-batch-size`	Override system maximum batch size.	0
`--enable-flashinfer-sample-kernel`	Enables FlashInfer sampling kernel.	True

Logging & Profiling#

Arguments	Description	Defaults
`--ft-nvtx`	Enables NVTX performance profiling.	False
`--py-inference-log-response`	Logs inference response content.	False
`--trace-memory`	Enables memory tracing.	False
`--trace-malloc-stack`	Enables malloc stack tracing.	False
`--enable-device-perf`	Collects device performance metrics.	False
`--ft-core-dump-on-exception`	Generates core dump on exception.	False
`--ft-alog-conf-path`	Log configuration file path.	None
`--log-level`	Log level (ERROR/WARN/INFO/DEBUG).	INFO
`--gen-timeline-sync`	Collects Timeline analysis data.	False
`--torch-cuda-profiler-dir`	Torch Profiler output directory.	“”

Speculative Decoding#

Arguments	Description	Defaults
`--sp-model-type`	Specifies draft model type (e.g. “deepseek-v3-mtp”)	“”
`--sp-type`	Controls speculative sampling type (“vanilla” disables, “mtp” enables)	“”
`--sp-min-token-match`	Minimum token match length	2
`--sp-max-token-match`	Maximum token match length	2
`--tree-decode-config`	Tree decode mapping configuration file	“”
`--gen-num-per-cycle`	Maximum number of tokens generated per cycle	1
`--force-stream-sample`	Forces streaming sampling	False
`--force-score-context-attention`	Forces context attention scoring	True

RPC and Service Discovery#

Arguments	Description	Defaults
`--use-local`	Uses local service discovery	False
`--remote-rpc-server-ip`	Remote RPC server address	None
`--decode-cm2-config`	Decode service discovery configuration	None
`--remote-vit-server-ip`	Remote ViT server address	None
`--multimodal-part-cm2-config`	Multimodal service discovery configuration	None

Cache Store#

Arguments	Description	Defaults
`--cache-store-rdma-mode`	Enables RDMA mode	False
`--wrr-available-ratio`	WRR load balancing availability threshold	80
`--rank-factor`	WRR ranking factor (0=KV_CACHE usage, 1=in-flight requests)	0

Scheduler Configuration#

Arguments	Description	Defaults
`--use-batch-decode-scheduler`	Enables batch decode scheduler	False
`--max-context-batch-size`	Maximum context batch size	1
`--scheduler-reserve-resource-ratio`	Reserved resource percentage	5
`--enable-fast-gen`	Enables long request chunking processing	False
`--fast-gen-context-budget`	Chunk processing size	None
`--enable-partial-fallback`	Allows partial resource reclamation	False
`--batch-decode-scheduler-batch-size`	Decode batch size	1

Load Balancing and Performance Optimization Configuration#

Arguments	Description	Defaults
`--load-balance`	Enables dynamic load balancing	False
`--step-records-time-range`	Performance record retention time window (microseconds)	60000000
`--step-records-max-size`	Maximum performance record count	1000
`--disable-pdl`	Disables PDL feature	False

3FS Configuration#

Arguments	Description	Defaults
`--enable-3fs`	Enables 3FS for managing KVCache	False

Model Adaptation#

Arguments	Description	Defaults
`--max-lora-model-size`	Maximum size limit for LoRA models	-1

System Debugging#

Arguments	Description	Defaults
`--gen-timeline-sync`	Collects Timeline analysis data	False
`--torch-cuda-profiler-dir`	Torch Profiler output directory	“”

Load Config#

Arguments	Description	Defaults
`--load_method`	Specify the weight loading method. Options: auto, fastsafetensors, scratch	auto

ServerArgs

Contents