PD Disaggregation#

Why and What is PD Disaggregation?#

Large Language Models (LLMs) separate Prefill (context preparation) and Decode (token generation) phases to:

Boost Efficiency: Prefill precomputes attention keys/values (KV caching) for the input sequence, enabling fast autoregressive decoding by reusing cached data—reducing compute from O(n²) to O(n) per token. Optimize Memory: Caching KV matrices during Prefill avoids redundant recomputation, slashing memory overhead during long-sequence generation. Leverage Hardware: Prefill exploits parallel processing for known inputs (full-sequence batched compute), while Decode optimizes latency-critical step-by-step generation. Scale Applications: Separation allows dynamic resource allocation (e.g., high-throughput Prefill for prompts + low-latency Decode for streaming outputs), vital for real-time use cases like chatbots.

start#

start prefill#

[ ]:

import subprocess from rtp_llm.utils.util import wait_sever_done, stop_server prefill_port=8090 decode_port=27001 server_process = subprocess.Popen( ["/opt/conda310/bin/python", "-m", "rtp_llm.start_server", "--checkpoint_path=/mnt/nas1/hf/models--Qwen--Qwen1.5-0.5B-Chat/snapshots/6114e9c18dac0042fa90925f03b046734369472f/", "--model_type=qwen_2", "--role_type=PREFILL", f"--start_port={prefill_port}", "--use_local=1", f"--remote_rpc_server_ip=127.0.0.1:{decode_port}" ] ) wait_sever_done(server_process, prefill_port)

start decode#

[ ]:

import subprocess from rtp_llm.utils.util import wait_sever_done, stop_server prefill_port=8090 decode_port=27001 server_process = subprocess.Popen( ["/opt/conda310/bin/python", "-m", "rtp_llm.start_server", "--checkpoint_path=/mnt/nas1/hf/models--Qwen--Qwen1.5-0.5B-Chat/snapshots/6114e9c18dac0042fa90925f03b046734369472f/", "--model_type=qwen_2", "--role_type=DECODE", f"--start_port={decode_port}", "--use_local=1", f"--remote_rpc_server_ip=127.0.0.1:{port}" ] ) wait_sever_done()
[ ]:
import openai

prefill_port=8090
client = openai.Client(base_url=f"http://127.0.0.1:{prefill_port}/v1/chat/completions", api_key="None")

response = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print(f"Response: {response}")

Advanced Configuration#

PD Disaggregation supports the following environment variables.

Prefill Server Configuration#

Variable

Description

Default

PREFILL_RETRY_TIMES

Number of retries for prefill process, 0 means retry disabled

0

PREFILL_RETRY_TIMEOUT_MS

Total timeout for prefill retries (milliseconds)

0

PREFILL_MAX_WAIT_TIMEOUT_MS

Maximum wait timeout for prefill execution (milliseconds)

600000

LOAD_CACHE_TIMEOUT_MS

Timeout for remote KVCache loading (milliseconds)

5000

DECODE_RETRY_TIMES

Number of retries for decode process, 0 means retry disabled

100

DECODE_RETRY_TIMEOUT_MS

Total timeout for decode process retries (milliseconds)

100

RDMA_CONNECT_RETRY_TIMES

Number of retries for RDMA connection establishment

5000

DECODE_POLLING_KV_CACHE_STEP_MS

Interval time for polling KV loading status (milliseconds)

30

DECODE_ENTRANCE

Whether Decode serves as traffic entry point

false