Sampling Config#
For the OpenAI protocol, top_p and temperature are specified in the outer protocol, while other sampling parameters are specified in extra_configs. Example:
{
"model":"Qwen_14B_pressure_test",
"messages":[
{
"role":"system",
"content":"You are a helpful assistant."
},
{
"role":"user",
"content":"Hello, what's the weather like in Hangzhou today"
}
],
"stream":true,
"temperature":1,
"max_tokens":1024,
"top_p":0.8,
"extra_configs" : {
"top_k": 1
}
}
The raw protocol specifies sampling parameters through generate_config. Example:
{
"model":"m6-13b-v1",
"prompt":"Human: write a list for trip\n\nAssitant:","generate_config":{
"top_k": 1,
"top_p": 0
}
}
Basic Control Parameters#
Parameter Name |
Core Function |
---|---|
temperature |
Controls sampling randomness: |
top_k |
Candidate set truncation strategy: |
top_p |
Nucleus sampling strategy: |
max_new_tokens |
Maximum generation length: |
min_new_tokens |
Enforces minimum generation length |
Advanced Control Parameters#
Parameter Name |
Function Description |
---|---|
repetition_penalty |
Repetition suppression factor: |
stop_words_list |
Token ID stop words (better performance): |
stop_words_str |
String stop words (better compatibility): |
random_seed |
Random seed control: |
# Stop Words Configuration Example
{
"stop_words_str": ["<|im_end|>", "\nObservation:"],
"stop_words_list": [[20490, 25], [50256]]
}
Return Control Parameters#
Parameter Name |
Effect |
Use Case |
---|---|---|
return_logits |
Returns logits matrix for each position |
Output analysis/post-processing |
return_hidden_states |
Returns hidden states of transformer layers |
Model debugging/feature extraction |
return_input_ids |
Returns input sequence encoding result |
Input validation |
return_output_ids |
Returns output sequence encoding result |
Output decoding validation |
Special Mode Parameters#
Mode Name |
Control Parameter |
Function Description |
---|---|---|
Thinking Mode |
in_think_mode=True |
Agent-specific scenario: |
Streaming Output |
yield_generator=True |
Enable chunked return mechanism |
Parallel Decoding |
pd_separation=True |
Enable parallel decoding optimization (hardware support required) |
Environment Variable Description#
# Force override stop words configuration
export FORCE_STOP_WORDS=true
export STOP_WORDS_STR="[\"</end>\",\"\\n\"]"
# Hybrid mode (default)
export FORCE_STOP_WORDS=false # union of environment variables, model defaults, and configuration parameters
Sampling Strategy Description#
Combined Strategy Example#
{
"temperature": 0.7,
"top_k": 50,
"top_p": 0.95,
"repetition_penalty": 1.2,
"top_p_decay": 0.9, # 10% decay per token
"top_p_min": 0.5 # Minimum decay threshold
}
Strategy Recommended Values#
Scenario |
temperature |
top_p |
top_k |
---|---|---|---|
Code Generation |
0.2-0.4 |
0.9 |
40 |
Creative Writing |
0.7-1.0 |
0.95 |
100 |
Factual Q&A |
0.1-0.3 |
0.8 |
20 |
Parameter Usage Notes#
Stop Words Selection Principles
Performance priority: Use
stop_words_list
for high-frequency triggering scenariosCompatibility priority: Use
stop_words_str
for complex pattern matching
Length Limit Coordination
Actual maximum length = min( input_token_len + max_new_tokens, MAX_SEQ_LEN )
Thinking Mode Special Constraints
# Need to configure simultaneously { "in_think_mode": True, "max_thinking_tokens": 512, # Control thinking phase length "max_new_tokens": 2048 # Control total output length }
Streaming Output Limitations
Need to enable
yield_generator=True
simultaneouslyreturn_logits
only returns complete data in non-streaming mode