Quantization#

RTP-LLM currently supports weight only quantization, including int8 and int4. It can significantly reduce the video memory footprint and accelerate the decoding phase. Known issues: Weight Only quantization may cause performance degradation for long sequences during the Prefill phase Currently, all quantization methods are supported in SM70 and above

Support Quant#

CardType	Int8WeightOnly	Int8W8A8	BlockWiseFp8	PerTensorFp8	INT4	PTPC
CUDA	✅	✅	✅	✅	✅	❌
AMD	❌	✅	✅	✅	✅	✅

GPTQ/AWQ#

Supports int4 and int8. Model weights needs to be quantified in advance(use AutoGPTQForCausalLM/AutoAWQForCausalLM).
The model config needs to contain quantization related config, containing bits, group_size, quant_method.
GPTQ config example:

"quantization_config": {
"bits": 4,
"group_size": 128,
"quant_method": "gptq"
}

Example AWQ config:

"quantization_config": {
"bits": 4,
"group_size": 128,
"quant_method": "awq"
}

W8A8#

smoothquant and omniquant are supported You need to include a file called “smoothquant.ini” under the ckpt path, or write config

"quantization_config": {
    "bits": 8,
    "quant_method": "omni_quant"
}

Supports llama, qwen, starcoder. The name of the tensor stored in ckpt is referred to the associated model file.

BlockWiseFp8#

Support Load Quant or PreQuantified.
You can use Load Quant by set args, Example

python3 -m rtp_llm.start_server --checkpoint_path XXXX  --model_type qwen_3  --quantization fp8_per_block

You can Provide PreQuantified Model Weight, The model config needs to contain quantization related config

"quantization_config": {
    "activation_scheme": "dynamic",
    "fmt": "e4m3",
    "quant_method": "fp8",
    "weight_block_size": [
        128,
        128
    ]
}

PerTensorFp8#

Support Load Quant or PreQuantified by TRT-LLM/TransformerEngine.
You can use Load Quant by set args, Example

python3 -m rtp_llm.start_server --checkpoint_path XXXX  --model_type qwen_3  --quantization fp8

You can Provide PreQuantified Model Weight, The model config needs to contain quantization related config

"quantization_config": {
    "quant_method": "FP8",
    "bits": 8
}

Int8WeightOnly#

Support Load Quant.You can use Load Quant by set args, Example

python3 -m rtp_llm.start_server --checkpoint_path XXXX  --model_type qwen_3  --quantization int8

Quantization

Contents