Sampling Config

Contents

Sampling Config#

For the OpenAI protocol, top_p and temperature are specified in the outer protocol, while other sampling parameters are specified in extra_configs. Example:

{
    "model":"Qwen_14B_pressure_test",
    "messages":[
        {
            "role":"system",
            "content":"You are a helpful assistant."
        },
        {
            "role":"user",
            "content":"Hello, what's the weather like in Hangzhou today"
        }
    ],
    "stream":true,
    "temperature":1,
    "max_tokens":1024,
    "top_p":0.8,
    "extra_configs" : {
        "top_k": 1
    }
}

The raw protocol specifies sampling parameters through generate_config. Example:

{
    "model":"m6-13b-v1",
    "prompt":"Human: write a list for trip\n\nAssitant:","generate_config":{
        "top_k": 1,
        "top_p": 0
    }
}

Basic Control Parameters#

Parameter Name	Core Function
temperature	Controls sampling randomness: → 0: Deterministic mode → 1: Standard random mode
top_k	Candidate set truncation strategy: → 0: Disabled → N: Take top N high-probability tokens
top_p	Nucleus sampling strategy: → 0.95: Take candidate set with cumulative probability of 95%
max_new_tokens	Maximum generation length: MIN(input_length + max_new_tokens, MAX_SEQ_LEN)
min_new_tokens	Enforces minimum generation length

Advanced Control Parameters#

Parameter Name	Function Description
repetition_penalty	Repetition suppression factor: → >1.0 suppresses repetition → <1.0 encourages repetition
frequency_penalty	This parameter is used to discourage the model from repeating the same words or phrases too frequently within the generated text. It is a value that is added to the log-probability of a token each time it occurs in the generated text. A higher frequency_penalty value will result in the model being more conservative in its use of repeated tokens.
presence_penalty	This parameter is used to encourage the model to include a diverse range of tokens in the generated text. It is a value that is subtracted from the log-probability of a token each time it is generated. A higher presence_penalty value will result in the model being more likely to generate tokens that have not yet been included in the generated text.
stop_words_list	Token ID stop words (better performance): `[[20490,25],[1024]]`
stop_words_str	String stop words (better compatibility): `["<end>","\nObservation"]`
random_seed	Random seed control: → None: True random → Fixed value: Reproducible generation

# Stop Words Configuration Example
{
    "stop_words_str": ["<|im_end|>", "\nObservation:"],
    "stop_words_list": [[20490, 25], [50256]]
}

Return Control Parameters#

Parameter Name	Effect	Use Case
return_logits	Returns logits matrix for each position	Output analysis/post-processing
return_hidden_states	Returns hidden states of transformer layers	Model debugging/feature extraction
return_input_ids	Returns input sequence encoding result	Input validation
return_output_ids	Returns output sequence encoding result	Output decoding validation

Special Mode Parameters#

Mode Name	Control Parameter	Function Description
Thinking Mode	in_think_mode=True	Agent-specific scenario: Control thinking phase length with max_thinking_tokens
Streaming Output	yield_generator=True	Enable chunked return mechanism
Parallel Decoding	pd_separation=True	Enable parallel decoding optimization (hardware support required)

Environment Variable Description#

# Force override stop words configuration
export FORCE_STOP_WORDS=true
export STOP_WORDS_STR="[\"</end>\",\"\\n\"]"

# Hybrid mode (default)
export FORCE_STOP_WORDS=false  # union of environment variables, model defaults, and configuration parameters

Sampling Strategy Description#

Combined Strategy Example#

{
    "temperature": 0.7,
    "top_k": 50,
    "top_p": 0.95,
    "repetition_penalty": 1.2,
    "top_p_decay": 0.9,        # 10% decay per token
    "top_p_min": 0.5           # Minimum decay threshold
}

Strategy Recommended Values#

Scenario	temperature	top_p	top_k
Code Generation	0.2-0.4	0.9	40
Creative Writing	0.7-1.0	0.95	100
Factual Q&A	0.1-0.3	0.8	20

Parameter Usage Notes#

Stop Words Selection Principles
- Performance priority: Use stop_words_list for high-frequency triggering scenarios
- Compatibility priority: Use stop_words_str for complex pattern matching

Length Limit Coordination

Actual maximum length = min(
    input_token_len + max_new_tokens,
    MAX_SEQ_LEN
)

Thinking Mode Special Constraints

# Need to configure simultaneously
{
    "in_think_mode": True,
    "max_thinking_tokens": 512,  # Control thinking phase length
    "max_new_tokens": 2048       # Control total output length
}

Streaming Output Limitations
- Need to enable yield_generator=True simultaneously
- return_logits only returns complete data in non-streaming mode