Speculative Decoding#

1. What is Speculative Sampling#

Speculative sampling is a zero-precision-loss universal inference acceleration technique:

A lightweight Propose Model generates several candidate tokens at once;
The original Score Model (large model) then validates these tokens in parallel;
Turning “verification” into a Prefill operation, thereby improving GPU compute-to-memory ratio and reducing Decode latency.

2. Speculative Sampling Algorithms Supported by RTP-LLM#

Name	Introduction	Source
vanilla	The most classic speculative sampling implementation	Leviathan et al., ICML’23
deterministic	Prompt-Lookup + Speculative Edit	Prompt Lookup, Cursor Blog
mtp	Speculative sampling framework based on DeepSeek-V3	DeepSeek-V3 Tech Report
eagle3	EAGLE-3	EAGLE-3 Paper

3. Using Speculative Sampling in RTP-LLM#

Based on the original basic startup parameters, add the following environment variables:

vanilla#

Arguments	Value	Description
–sp_type	vanilla	Speculative sampling strategy
–sp_checkpoint_path		Small model weight path
–sp_model_type	qwen	Small model architecture, same as the main model
–sp_quantization	FP8_PER_BLOCK/FP8	Small model quantization method: FP8, FP8_PER_BLOCK, etc.
–gen_num_per_cycle	5	How many tokens the small model proposes per cycle

deterministic#

Arguments	Value	Description
–sp_type	deterministic	Speculative sampling strategy
–gen_num_per_cycle	128	How many tokens the small model proposes per cycle
–sp_min_token_match	2	Minimum length of n-gram token matching
–sp_max_token_match	2	Maximum length of n-gram token matching

And supplement in the request's `extra_config`:
```json
{
  "sp_advice_prompt": "<text you expect the LLM to continue generating>",
  "sp_edit": 0          // 0=regular Prompt-Lookup; 1=Speculative Edit
}

mtp#

Arguments	Value	Description
–sp_type	mtp	Speculative sampling strategy
–sp_checkpoint_path		Small model weight path
–sp_model_type	qwen_2_mtp	MTP small model type
–gen_num_per_cycle	5	How many tokens the small model proposes per cycle
–sp_quantization	FP8_PER_BLOCK/FP8	Small model quantization method: FP8, FP8_PER_BLOCK, etc.

eagle3#

Arguments	Value	Description
–sp_type	eagle3	Speculative sampling strategy
–sp_checkpoint_path		Small model weight path
–sp_model_type	qwen_3_moe_eagle3	EAGLE3 small model type
–gen_num_per_cycle	5	How many tokens the small model proposes per cycle
–sp_quantization	FP8_PER_BLOCK/FP8	Small model quantization method: FP8, FP8_PER_BLOCK, etc.

4. Performance Observation & Tuning#

4.1 Performance Observation#

Add the following to the request body:

"aux_info": true

Example response fields:

cost_time   : 123 ms      // End-to-end
output_len  : 60 tokens   // Number of output tokens
iter_count  : 12          // Speculative rounds
avg_tokens_per_iter = output_len / iter_count = 5 // Average tokens accepted per round

The most important metric for speculative sampling is avg_tokens_per_iter, higher is better.

4.2 Tuning#

vanilla#

Model Selection:
- Choose a smaller size from the same series (e.g., Qwen2.5-0.5B).
- Apply INT4 quantization to the small model whenever possible.
gen_num_per_cycle:
- Default is 5; can be increased if acceptance rate >40%.

deterministic#

Parameter	Recommendation	Notes
sp_min/max_token_match	2	n-gram length range
gen_num_per_cycle	128 (batch=1)	Can be increased for long sequence editing scenarios
sp_edit	Set to 1 for code/text editing, 0 otherwise	Controls matching start point
sp_advice_prompt	Only retain suffixes that may actually appear	Reduce invalid matches

mtp / eagle3#

Model Training:
- Use https://github.com/SafeAILab/EAGLE to train small models for specific business scenarios
- Need to ensure 1st token acceptance rate >80%, 2nd token acceptance rate >60%, 3rd token acceptance rate >40%
gen_num_per_cycle:
- Execution time of MTP small model can be assumed to be about 1ms. Based on the main model’s execution time and acceptance rate, the optimal GEN_NUM_PER_CIRCLE can be calculated
sp_quantization
- On Hopper series, it is recommended to enable sp_quantization=FP8_PER_BLOCK

Speculative Decoding

Contents

Speculative Decoding#

1. What is Speculative Sampling#

2. Speculative Sampling Algorithms Supported by RTP-LLM#

3. Using Speculative Sampling in RTP-LLM#

vanilla#

deterministic#

mtp#

eagle3#

4. Performance Observation & Tuning#

4.1 Performance Observation#

4.2 Tuning#

vanilla#

deterministic#

mtp / eagle3#