Speculative Decoding#
1. What is Speculative Sampling#
Speculative sampling is a zero-precision-loss universal inference acceleration technique:
A lightweight Propose Model generates several candidate tokens at once;
The original Score Model (large model) then validates these tokens in parallel;
Turning “verification” into a Prefill operation, thereby improving GPU compute-to-memory ratio and reducing Decode latency.
2. Speculative Sampling Algorithms Supported by RTP-LLM#
Name |
Introduction |
Source |
---|---|---|
vanilla |
The most classic speculative sampling implementation |
|
deterministic |
Prompt-Lookup + Speculative Edit |
|
mtp |
Speculative sampling framework based on DeepSeek-V3 |
|
eagle3 |
EAGLE-3 |
3. Using Speculative Sampling in RTP-LLM#
Based on the original basic startup parameters, add the following environment variables:
vanilla#
Arguments |
Value |
Description |
---|---|---|
–sp_type |
vanilla |
Speculative sampling strategy |
–sp_checkpoint_path |
Small model weight path |
|
–sp_model_type |
qwen |
Small model architecture, same as the main model |
–sp_quantization |
FP8_PER_BLOCK/FP8 |
Small model quantization method: FP8, FP8_PER_BLOCK, etc. |
–gen_num_per_cycle |
5 |
How many tokens the small model proposes per cycle |
deterministic#
Arguments |
Value |
Description |
---|---|---|
–sp_type |
deterministic |
Speculative sampling strategy |
–gen_num_per_cycle |
128 |
How many tokens the small model proposes per cycle |
–sp_min_token_match |
2 |
Minimum length of n-gram token matching |
–sp_max_token_match |
2 |
Maximum length of n-gram token matching |
And supplement in the request's `extra_config`:
```json
{
"sp_advice_prompt": "<text you expect the LLM to continue generating>",
"sp_edit": 0 // 0=regular Prompt-Lookup; 1=Speculative Edit
}
mtp#
Arguments |
Value |
Description |
---|---|---|
–sp_type |
mtp |
Speculative sampling strategy |
–sp_checkpoint_path |
Small model weight path |
|
–sp_model_type |
qwen_2_mtp |
MTP small model type |
–gen_num_per_cycle |
5 |
How many tokens the small model proposes per cycle |
–sp_quantization |
FP8_PER_BLOCK/FP8 |
Small model quantization method: FP8, FP8_PER_BLOCK, etc. |
eagle3#
Arguments |
Value |
Description |
---|---|---|
–sp_type |
eagle3 |
Speculative sampling strategy |
–sp_checkpoint_path |
Small model weight path |
|
–sp_model_type |
qwen_3_moe_eagle3 |
EAGLE3 small model type |
–gen_num_per_cycle |
5 |
How many tokens the small model proposes per cycle |
–sp_quantization |
FP8_PER_BLOCK/FP8 |
Small model quantization method: FP8, FP8_PER_BLOCK, etc. |
4. Performance Observation & Tuning#
4.1 Performance Observation#
Add the following to the request body:
"aux_info": true
Example response fields:
cost_time : 123 ms // End-to-end
output_len : 60 tokens // Number of output tokens
iter_count : 12 // Speculative rounds
avg_tokens_per_iter = output_len / iter_count = 5 // Average tokens accepted per round
The most important metric for speculative sampling is avg_tokens_per_iter, higher is better.
4.2 Tuning#
vanilla#
Model Selection:
Choose a smaller size from the same series (e.g., Qwen2.5-0.5B).
Apply INT4 quantization to the small model whenever possible.
gen_num_per_cycle:
Default is 5; can be increased if acceptance rate >40%.
deterministic#
Parameter |
Recommendation |
Notes |
---|---|---|
sp_min/max_token_match |
2 |
n-gram length range |
gen_num_per_cycle |
128 (batch=1) |
Can be increased for long sequence editing scenarios |
sp_edit |
Set to 1 for code/text editing, 0 otherwise |
Controls matching start point |
sp_advice_prompt |
Only retain suffixes that may actually appear |
Reduce invalid matches |
mtp / eagle3#
Model Training:
Use https://github.com/SafeAILab/EAGLE to train small models for specific business scenarios
Need to ensure 1st token acceptance rate >80%, 2nd token acceptance rate >60%, 3rd token acceptance rate >40%
gen_num_per_cycle:
Execution time of MTP small model can be assumed to be about 1ms. Based on the main model’s execution time and acceptance rate, the optimal GEN_NUM_PER_CIRCLE can be calculated
sp_quantization
On Hopper series, it is recommended to enable sp_quantization=FP8_PER_BLOCK