PD Disaggregation#
Why and What is PD Disaggregation?#
Large Language Models (LLMs) separate Prefill (context preparation) and Decode (token generation) phases to:
Boost Efficiency: Prefill precomputes attention keys/values (KV caching) for the input sequence, enabling fast autoregressive decoding by reusing cached data—reducing compute from O(n²) to O(n) per token. Optimize Memory: Caching KV matrices during Prefill avoids redundant recomputation, slashing memory overhead during long-sequence generation. Leverage Hardware: Prefill exploits parallel processing for known inputs (full-sequence batched compute), while Decode optimizes latency-critical step-by-step generation. Scale Applications: Separation allows dynamic resource allocation (e.g., high-throughput Prefill for prompts + low-latency Decode for streaming outputs), vital for real-time use cases like chatbots.
start#
start prefill#
[ ]:
import subprocess
from rtp_llm.utils.util import wait_sever_done, stop_server
prefill_port=8090
decode_port=27001
server_process = subprocess.Popen(
["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
"--checkpoint_path=/mnt/nas1/hf/models--Qwen--Qwen1.5-0.5B-Chat/snapshots/6114e9c18dac0042fa90925f03b046734369472f/",
"--model_type=qwen_2",
"--role_type=PREFILL",
f"--start_port={prefill_port}",
"--use_local=1",
f"--remote_rpc_server_ip=127.0.0.1:{decode_port}"
]
)
wait_sever_done(server_process, prefill_port)
start decode#
[ ]:
import subprocess
from rtp_llm.utils.util import wait_sever_done, stop_server
prefill_port=8090
decode_port=27001
server_process = subprocess.Popen(
["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
"--checkpoint_path=/mnt/nas1/hf/models--Qwen--Qwen1.5-0.5B-Chat/snapshots/6114e9c18dac0042fa90925f03b046734369472f/",
"--model_type=qwen_2",
"--role_type=DECODE",
f"--start_port={decode_port}",
"--use_local=1",
f"--remote_rpc_server_ip=127.0.0.1:{port}"
]
)
wait_sever_done()
[ ]:
import openai
prefill_port=8090
client = openai.Client(base_url=f"http://127.0.0.1:{prefill_port}/v1/chat/completions", api_key="None")
response = client.chat.completions.create(
model="qwen/qwen2.5-0.5b-instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(f"Response: {response}")
Advanced Configuration#
PD Disaggregation supports the following environment variables.
Prefill Server Configuration#
Variable |
Description |
Default |
---|---|---|
PREFILL_RETRY_TIMES |
Number of retries for prefill process, 0 means retry disabled |
|
PREFILL_RETRY_TIMEOUT_MS |
Total timeout for prefill retries (milliseconds) |
|
PREFILL_MAX_WAIT_TIMEOUT_MS |
Maximum wait timeout for prefill execution (milliseconds) |
|
LOAD_CACHE_TIMEOUT_MS |
Timeout for remote KVCache loading (milliseconds) |
|
DECODE_RETRY_TIMES |
Number of retries for decode process, 0 means retry disabled |
|
DECODE_RETRY_TIMEOUT_MS |
Total timeout for decode process retries (milliseconds) |
|
RDMA_CONNECT_RETRY_TIMES |
Number of retries for RDMA connection establishment |
|
DECODE_POLLING_KV_CACHE_STEP_MS |
Interval time for polling KV loading status (milliseconds) |
|
DECODE_ENTRANCE |
Whether Decode serves as traffic entry point |
|